throbber
MEMORY BANDWIDTH IN
`
`RECUNFIGURABLE HARDWARE
`11.8 E
`Ear
`EV331755319US
`Express Mail Label No.
`3% E8
`V...“ hag—=0C)
`CommissionerforPatents
`
`P0 BoxMSO
`“to E
`@1— E
`Alexandria. VA 223131450
`
`
`6 Cl Application Data Sheet. (See 37 CPR 1. 76
`
`APPL'CAT'ON
`
`ELEMENT
`
`3
`
`2. CI
`
`3. E
`
`E Fee Transmittal Form
`(submit an original and a duplicate for fee processing}
`Applicant claims small entity status.
`See 37 CPR 1.27
`total pages _25_ ]
`Specification I
`(preferred Arrangement set forth below}
`- Descriptive title of the Invention
`- Cross References to Related Applications
`- Statement Regarding Fed sponsored RSID
`- Reference to sequence listing. a table. or a
`computer program listing appendix
`- Background of the invention
`- Brief Summary of the Invention
`
`7. I] CD-RDM or CD-R in duplicate. large table
`or Computer Program (Appendix)
`8. Nucleotide andlor Amino Acid Sequence Submission
`(if applicable. all necessary)
`a.
`[3
`Computer Readable Form
`b.
`El
`Specification Sequence Listing on:
`i_
`|:| CD-ROM or CD-R {2 copies); or
`ii. III paper
`
`c.
`
`Statements verifying identity of above copies
`E]
`ACCOMPANYING APPLICATION PARTS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Brief Description of the Drawings
`9. E Assignment Papers [coversheatldocumenfisfi
`
`
`
`- Detailed Description
`E Power of
`10_ CI
`3? CFR. 31303} Statement
`
`
`
`
`. Claimls)
`(when there is an assign-reel
`Attorney
`
`
`
`" Abstract N the Disclosure
`English Translation Document
`11. I]
`
`4.
`IE] Drawing(s}[ total sheets 12 ]
`12, III ms 3. Form PTOlSBlOBA
`5. EOathorDeclaratioM tolalpages __3__]
`13. I]-
`Preliminary Amendment
`
`
`a. E Newly executed (original or copy}
`14. EB] Return Receipt Postcard (MPEP 503)
`
`
`b. [I Copy from prior appl {37 c. F. R. § 1 saw);
`15. [:| Certified Copy of Priority Document(s)
`(for continuationldivisional with Box 18 completed)
`
`
`i.
`IjDELETION OF INVENTORIS}
`15. E] Nonpublication Request Under 35 USC
`
`Signed statement attached deleting
`
`.
`.
`_
`_
`_
`122(b}(2}(8}(i}.Applicant must attach form PTOlSBl35
`
`Inventor(st named to prior application.
`
`
`17. B Other: Certificate of Mailin b Exress Mail
`seeaTC.F.R_‘,;1.63d 2 andtaab.
`
`If a CONTINUING APPLICATION. check appropriate box. and supply the requisite information below and in a
`18.
`preliminary amendment, or in an Application Data Sheet under 3? CFR. 1.76:
`
`1M39L90||I||l|Il|ll|lIIIIIIIIIIIIII
`
`TRANSMITTAL
`(only for newI nonprovlslonal applications under
`
`F’TOISBI‘OS (04-04)
`A- -roved for use throu-h 0751:9006
`
`
`
`Attorney Docket No.
`SR0028
`UTILITY
`PATENT APPLICATION
`
`
`
`First Inventor
`
`
`
`
`‘
`SYSTEM AND METHOD OF ENHANCING
`
`
`Daniel Poznanovic et at.
`
`
`EFFICIENCY AND UTILIZATION OF 3? CFR 1-53Ibll
`
`
`
`
`
`
`E] Continuation I___I Divisional El Continuation—in—parthlP)
`
`of prior application No.: _l
`
`E] Copies of IDS
`Citations
`
`
`
`
`
`
`
`
`Group/Art Unit:
`Examinen
`Prior application infonnation:
`
`
`EQfl cflfl
`INUATIQN QR DIVI§IQN§L flPPfi pm: The entire disclosure of the prior application. from whid‘t an oath or declaralion is supplied under Boot 5b. is oansidered a part of
`
`
`the disdosure oi the accompanying continuation or divisional application and is hereby incorporated by reference. The incorporation can 9ng be relied upon when a portion has been -
`inadvertentl omitted torrn lite submitted EJ'IICEHU“ -alti.
`
`
` 19. CORRESPONDENCE ADDRESS
`
`
`Customer Number 25235
`or [:1 Correspondence address below
`
`—_
`
`
`
`m———m—
`
`
`
`
`
`
`
`
`Doria-WE”!
`
`
`A‘s. 4 Adv
`
`\\\CS - 80'1041'0033 - 68264 v1
`
`1
`
`XILINX 1002
`
`XILINX 1002
`
`1
`
`

`

`EXPRESS MAIL NO. EV331755319US
`
`Attorney Docket No. SRCO28
`Client/Matter No. 804040033001
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Application of:
`
`Daniel Poznanovic, David E. Caliga,
`' and Jeffrey Hammes
`
`Serial No. NEW
`
`For: SYSTEM AND METHOD OF
`ENHANCING EFFICIENCY AND
`UTILIZATION OF MEMORY
`BANDWIDTH IN RECONFIGURABLE
`HARDWARE
`
`
`Filed: Herewith
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`Commissioner for Patents
`PO. Box 1450
`
`Alexandria, VA 22313-1450
`
`Sir:
`
`1.
`2.
`3.
`4.
`5.
`6.
`
`The undersigned hereby certifies that the following documents:
`Utility Patent Application Transmittal;
`Fee Transmittal and $928 filing fee;
`Utility Patent Application- 22 pgs. Spec, 3 pgs. Claims, 1 pg. Abstract;
`Executed Declaration for Utility Patent Application;
`12 sheets of drawings;
`Recordation Form Cover Sheet PTO 1595 with Executed
`
`Assignment and Recording Fee of $40.00;
`Return postcard; and
`7.
`Certificate of Mailing By Express Mail
`8.
`relating to the above application, were deposited as "Express Mail" Mailing Label
`No. EV331755319US, with the United States Postal Service, addressed to Commissioner
`for Patents, PO. Box 1450, Alexandria, VA 22313-
`
`
`
`William .Kubida,
`eg. No.29,664
`
`HOGAN & HARTSON 1.1.9
`One Tabor Center
`
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`\\\CS - 8040410033 ~ 68264 v1
`
`2
`
`

`

`FEE TRANSMITTAL
`
`Complete if Known
`
`
`
`
`
`
`for FY 2004
`e 10i01i2003. Patent fees are subiect to annual revision
`
`First Named Inventor
`
`Daniel Poznanovic et al.
`
`
`
`
`
`
`
`
`
`
`summertime —
`
`
`Group ’ A“ Unit
`i A-élicant claims small entity status. See 37 CFR 1.2?
`SRCUZB
`TOTAL AMOUNT OF PAYMENT
`($) 968.00
`Attorney Docket N01
`
`
`
`METHOD OF PAYMENT (check all that apply)
`FEE CALCULATION (continued)
`
`E check 1:] credit card El money order [I other [:1 none
`3. ADDITIONAL FEES
`
`CI Deposit Account
`
`
`
`
`ms:
`Fee Description
`
`Number
`
`
`
`
`
`Surcharge - late filing fee or oath
`Surcharge - late provisional filing fee or
`Deposit
`cover sheet
`Account
`
`Non—English specification
`Name
`
`
`
` The Director is authorized to: (check all that apply) For filing a request for ex parte
`
`D Charge feels) indicated below E Credit any overpayments
`reexamination .
`_
`_
`E Charge any additional tests} or any underpayment or iae(a)
`Requesting PU'OlIGBiID" 01 SIR poor to
`
`ha
`i
`'nttica
`below, a ce
`for the fillrl
`tee to the above-
`Examineraction
`D fideflfiesflissi Wtednt
`l pt
`9
`Requesting publication of SRI after
`
`Examiner action
`Extension tor reptywithin first month
`Extension for replyvvithin second month
`Extension for reply within third month
`Extension for reply within tourtl'l month
`Fee Description
`
` Fee Paid
`
`Extension for reply within fifth monli'l
`Utility Filing Fee
`770.00
`
`
`
`Notice of Appeal
`Design filing fee
`
`265
`Plant filing fee
`Filing a brlel in support oian appeal
`
`385
`Reissue filing fee
`Request for oral hearing
`
`Petition to institute a public use
`8D
`Provisional filing fee
`proceeding
`Petition to revive — unavoidable
`Petition to revive - unintentional
`
`
`
`
`
`
`
`‘Reduced by Basic Fling Fee Faid
`SUBTOTAL {3]
`[5} 40.00
`
`
`
`Registration No.
`
`
`
`(Attomoymganli
`
`SUBTOTAL [2)
`
`[5) 158.00
`
`SUBMITTED BY
`
`Name [Pianype
`
`
`
`
`a «table
`
`Com-l
`
`.r
`i[ .F
`on; =
`2
`' A
`swathVang!
`
`\\\CS . 8060410033 - 6320-: v1
`
`2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE
`”1"“? its” "’9 i‘" missuei
`
`
`Fae [mm
`Fae Paid
`Design issue fee
`Extra Claims
`below
`Plant lSSUB tee
`?2.DD
`
`
`
`.3“:
`
`33.00
`
`Petitions [O the Commissioner
`
`II " “
`“W”
`X
`Iridapandanl
`II [I
`claims [I
`Processing tee under 3? CFR 1.1 rlq)
`_
`Multiple Dependent
`“or numberprevlouslypaid ifgreater. For Reissues. see below
`Submsslon of Info Dlsclosure Slmt
`Fee Descriptimt
`Recording each patent assignment per
`
`property {limes number or properties}
`Claims in excess oi 20
`Filing a submission alter final reiection (it?r
`
`CFR§1.129(al}
`Independent claims in excess ol'3
`For each additional invention to be
`
`examined (or are 51.12am);
`Multiple dependent claim‘ it not paid
`Request for Continued Examination
`
`43
`“Reissue independent claims over
`Request for expedited examination of a
`originat patent
`design application
`
`9
`“Reissue claims in excess at 20 and over
`Other lee (specify)
`originatpetent
`
`3
`
`

`

`PATENT APPLICATION
`ATTORNEY DOCKET No. SRCOZB
`ClienUMatter No. 804040033001
`
`Express Mail Label No. EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING EFFICIENCY
`AND UTILIZATION OF MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`1.
`
`Related Applications.
`
`[0001]The present
`
`invention claims the benefit of US. Provisional Patent
`
`application Serial No. 60(479339 filed on June 18. 2003, which is incorporated
`
`herein by reference in its entirety.
`
`BACKGROUND OF THE INVENTION
`
`1.
`
`Field of the Invention.
`
`[0002] The present invention relates. in general, to enhancing the efficiency and
`
`utilization of memory bandwidth in reconfigurable hardware. More specifically.
`
`the
`
`invention
`
`relates
`
`to implementing
`
`explicit memory hierarchies
`
`in
`
`reconfigurable processors that make efficient use of off-board, on-board. on-
`
`chip storage and available algorithm locality. These explicit memory hierarchies
`
`avoid many of the tradeoffs and complexities found in the traditional memory
`
`hierarchies of microprocessors.
`
`2.
`
`Relevant Background.
`
`.[0003] Over
`
`the past 30 years. microprocessors have enjoyed annual
`
`performance gains averaging about 50% per year. Most of the gains can be
`
`attributed to higher processor clock speeds. more memory bandwidth and
`
`increasing utilization of instruction level parallelism (iLP) at execution time.
`
`[00041As microprocessors and other dense logic devices (DLDs) consume
`
`data at ever-increasing rates it becomes more of a challenge to design memory
`
`hierarchies that can keep up.
`
`Two measures of the gap between the
`
`microprocessor and memory hierarchy are bandwidth efficiency and bandwidth
`
`\\\CS - 3040410033 - 68254 v2
`
`1
`
`4
`
`

`

`utilization. Bandwidth efficiency refers to the ability to exploit available locality
`
`in a program or algorithm.
`
`In the ideal situation. when there is maximum
`
`bandwidth efficiency. all available locality is utilized. Bandwidth utilization refers
`
`to the amount of memory bandwidth that
`
`is utilized during a calculation.
`
`Maximum bandwidth utilization occurs when all available memory bandwidth is
`
`utilized.
`
`[0005] Potential performance gains from using a faster microprocessor can be
`
`reduced or even negated by a corresponding drop in bandwidth efficiency and
`
`bandwidth utilization. Thus.
`
`there has been significant effort spent on the
`
`development of memory hierarchies that can maintain high bandwidth efficiency
`
`and utilization with faster microprocessors.
`
`[0006] One approach to improving bandwidth efficiency and utilization in
`
`memory hierarchies has been to develop ever more powerful processor
`
`caches. These caches are high-speed memories (typically SRAM) in close
`
`proximity to the microprocessor that try to keep copies of instructions and data
`
`the microprocessor may soon need. The microprocessor can store and retrieve
`
`data from the cache at a much higher rate than from a slower. more distant
`
`main memory.
`
`[0007]|n designing cache memories, there are a number of considerations to
`
`take into account. One consideration is the width of the cache line. Caches
`
`are arranged in lines to help hide memory latency and exploit spatial locality.
`
`When a load suffers a cache miss. a new cache line is loaded from main
`
`memory into the cache. The assumption is that a program being executed by
`
`the microprocessor has a high degree of spatial locality, making it likely that
`
`other memory locations in the cache line will also be required.
`
`[0008] For programs with a high degree of spatial
`
`locality (e.g.. stride-one
`
`access). wide cache lines are more efficient since they reduce the number of
`
`times a processor has to suffer the latency of a memory access. However, for
`
`programs with lower levels of spatial locality, or random access. narrow lines
`
`““35 - 304ml! - 6825-1 v2
`
`-
`
`'2‘
`
`5
`
`

`

`are best as they reduce the wasted bandwidth from the unused neighbors in
`
`the cache line. Caches designed with wide cache lines perform well with
`
`programs that have a high degree of spatial locality, but generally have poor
`gatherl'scatter performance. Likewise, caches with short cache lines have good
`
`gatherfscatter performance, but loose efficiency executing programs with high
`
`spatial locality because of the additional runs to the main memory.
`
`[00091Another consideration in cache design is cache associativity, which
`
`refers to the mapping between locations in main memory and cache sectors.
`
`At one extreme of cache associativity is a direct-mapped cache, while at
`
`another extreme is a fully associative cache.
`
`In a direct mapped~cache, a
`
`specific memory location can be mapped to only a single cache line. Direct-
`
`mapped caches have the advantage of being fast and easy to construct in
`
`logic. The disadvantage is that they suffer the maximum number of cache
`
`conflicts. At the other extreme, a fully associative cache allows a specific
`
`location in memory to be mapped to any cache line. Fully associative caches
`
`tend to be slower and more complex due to the large amount of comparison
`
`logic they need, but suffer no cache conflict misses. Oftentimes, caches fall
`
`between the extremes of direct-mapped and fully associative caches. A design
`
`point between the extremes is a k—set aesociative cache, where each memory
`
`location can map to k cache sectors. These caches generally have less
`
`overhead than fully associative caches, and reduce cache conflicts by
`
`increasing the value of k.
`
`[00101Another consideration in cache design is how cache lines are replaced
`
`due to a capacity or conflict miss.
`
`In a direct-mapped cache, there is only one
`
`possible cache line that can be replaced due to a miss. However, in caches
`
`with higher ievels of associativity, cache lines can be replaced in more that one
`
`way. The way the cache lines are replaced is referred to as the replacement
`
`poiicy.
`
`“\CS - 304041003] - 68254 v2
`
`'3’
`
`6
`
`

`

`[0011] Options for the replacement policy include least recently used (LRU),
`
`random replacement. and first in—first out (FIFO). LRU is used in the majority
`
`of circumstances where the temporal locality set is smaller than the cache size.
`
`but
`
`it
`
`is normally more expensive to build in hardware than a random
`
`replacement cache. An LRU policy can also quickly degrade depending on the
`
`working set size. For example. consider an iterative application with a matrix
`
`size of N bytes running through a LRU cache of size M bytes.
`
`If N is less than
`
`M. then the policy has the desired behavior of 100% cache hits, however. if N is
`
`only slightly larger than M. the LRU policy results in 0% cache hits as lines are
`
`removed just as they are needed.
`
`[0012] Another consideration is deciding on a write policy for the cache. Write-
`
`through caches send data through the cache hierarchy to main memory. This
`
`policy reduces cache coherency issues for multiple processor systems and is
`
`best suited for data that will not be re-read by the processor in the immediate
`
`future.
`
`In contrast. write-back caches place a copy of the data in the cache. but
`
`does not immediately update main memory. This type of caching works best
`
`when a data just written to the cache is quickly requested again by the
`
`pro 08830 r.
`
`[0013] In addition to write-through and write-back caches. another kind of write
`
`policy is implemented in a write-allocate cache where a cache line is allocated
`
`on a write that misses in cache. Write-allocate caches improve performance
`
`when the microprocessor exhibits a lot of write followed by road behavior.
`
`However. when writes are not subsequently read, a write-allocate cache has a
`
`number of disadvantages: When a cache line is allocated.
`
`it is necessary to
`
`read the remaining values from main memory to complete the cache line. This
`
`adds unnecessary memory read traffic during store operations. Also, when the
`
`data is not read again. potentially useful data in the cache is displaced by the
`
`unused data.
`
`was . ammmn . 6825: v1
`
`.4-
`
`7
`
`

`

`[0014] Another consideration is made between the size and the speed of the
`
`cache:
`
`small caches are typically much faster than larger caches. but store
`
`less data and fewer instructions. Less data means a greater chance the cache
`
`will not have data the microprocessor is requesting (Le. a cache miss) which
`
`can slow everything down while the data is being retrieved from the main
`
`memory.
`
`[0015] Newer cache designs reduce the frequency of cache misses by trying to
`
`predict in advance the data that the microprocessor will mquest. An example of
`
`this type of cache is one that supports specutative execution and branch
`
`prediction. Speculative execution allows instructions that likely will be executed
`
`to start early based on branch prediction. Results are stored in a cache cailed
`
`a reorder buffer and retired if the branch was correctly predicted. Of course.
`
`when mis-predictions occur instruction and data bandwidth are wasted.
`
`[0016] There are additional considerations and tradeoffs in cache design, but it
`
`should be apparent from the considerations described hereinbefore that it
`
`is
`
`very difficult
`
`to design a single cache structure that is optimized for many
`
`different programs. This makes cache design particularly challenging for a
`
`multipurpose microprocessor that executes a wide variety of programs. Cache
`
`designers try to derive the program behavior of "average" program constructed
`
`from several actual programs that run on the microprocessor. The cache is
`
`optimized for the average program, but no actual program behaves exactiy like
`
`the average program. As a result, the designed cache ends up being sub-
`
`optimal for nearly every program actually executed by the microprocessor.
`
`Thus.
`
`there is a need for memory hierarchies that have data storage and
`
`retrieval characteristics that are optimized for actual programs executed by a
`
`processor.
`
`[0017] Designers trying to develop ever more efficient caches optimized for a
`
`variety of actual programs also face another problem: as caches add additional
`
`features, the overhead needed to implement the added features also grows.
`
`NCS - 804mm” - 68254 12
`
`'5'
`
`8
`
`

`

`Caches today have so much overhead that microprocessor performance may
`
`be reaching a point of diminishing returns as the overhead starts to cut into
`
`performance.
`
`In the Intel Pentium ill processor for exampie, more than half of
`
`the 10 million transistors are dedicated to instruction cache. branch prediction.
`
`out-of-order execution and supers‘calar logic. The situation has prompted
`
`predictions that as microprocessors grow to a billion transistors per chip.
`
`performance increases will drop to about 20% per year. Such a prediction. if
`
`borne out. could have a significant
`
`impact on technology growth and the
`
`computer business.
`
`[0018] Thus, there is a growing need to develop improved memory hierarchies
`
`that limit the overhead of a memory hierarchy without also reducing bandwidth
`
`efficiency and utilization.
`
`SUMMARY OF THE iNVENTION
`
`[0019] Accordingly, an embodiment of the invention inciudes a reconfigurable
`
`processor that includes a computational unit and a data access unit coupled to
`
`the computational unit. where the data access unit retrieves data from an on-
`
`processor memory and supplies the data to the computational unit. and where
`
`the computational unit and the data access unit are configured by a program.
`
`[0020] The present
`
`invention aiso involves a reconfigurable processor that
`
`includes a first memory of a first type and a data prefetch unit coupled to the
`
`memory. where the data prefetch unit retrieves data from a second memory of
`
`a second type different from the first type, and the first and second memory
`
`types and the data prefetch unit are configured by a program.
`
`[0021]Another embodiment of
`
`the
`
`invention includes a reconfigurable
`
`hardware system that includes a common memory, also referred to as external
`
`memory. and one or more reconfigurable processors coupled to the common
`
`memory, where at least one of the reconfigurable processors includes a data
`
`prefetch unit to read and write data between the unit and the common memory.
`
`“\CS - BMW“!!! - 6825-! v2
`
`'6‘
`
`9
`
`

`

`and where the data prefetch unit is configured by a program executed on the
`
`system.
`
`[0022] Another embodiment of the invention includes a method of transferring
`
`data that includes transferring data between a memory and a data prefetch unit
`
`in a reconfigurable processor. transferring data between the prefetch unit and a
`
`data access unit. and transferring the data between a computational unit and
`
`the data access unit, where the computational unit, data access unit and the
`
`data prefetch unit are configured by a program.
`
`[00231Additional embodiments of the invention are set forth in part
`
`in the
`
`description that follows, and in part will become apparent to those skilied in the
`
`art upon examination of the following specification. or may be learned by the
`
`practice of the invention. The advantages of the invention may be realized and
`
`attained by means of the instrumentalities, combinations, compositions, and
`
`methods particularly pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0024] Figure 1
`
`shows a reconfigurable processor in which the present
`
`invention may be impiemented;
`
`[0025] Figure 2 shows computational
`
`logic as might be loaded into a
`
`reconfigurable processor:
`
`[0026] Figure 3 shows a reconfigurable processor as in Figure 1, but with the
`
`additipn of data access units;
`
`[0027] Figure 4 shows a reconfigurable processor as in Figure 3, but with the
`
`addition of data prefetch units;
`
`[0028] Figure 5 shows reconfigurable processor with the inclusion of external
`
`memory;
`
`\\\cs — Emma: - 68254 v2
`
`'7'
`
`10
`
`10
`
`

`

`[0029] Figure 6 shows reconfigurable processors with external memory and
`
`with an intelligent memory controlier;
`
`[0030] Figure 7 shows a reconfigurable processor having a combination of
`
`data prefetch units and data access units feeding computational logic;
`
`[0031] Figure 8 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controlier to
`
`perform strided memory references;
`
`[0032] Figure 9A and Figure QB show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Y plane;
`
`[0033} Figure 10A and Figure 10B show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X—Z plane;
`
`[0034} Figure 11A and Figure 113 show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in Y-Z plane:
`
`[0035] Figure 12A and Figure 123 show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in a mini-cube;
`
`[0036} Figure 13 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform indirect memory references;
`
`[0037} Figure 14 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory reference together with computation.
`
`mes - 9mm” . 6515-1 «2
`
`-8-
`
`11
`
`11
`
`

`

`DETAILED DESCRIPTlON
`
`1.
`
`Definitions:
`
`[0038]Direct execution logic (DEL)
`
`-
`
`is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch andlor data access
`
`functionality for maximizing the parallelism inherent in the particular code.
`
`[0039] Reconfigurable Processor — is a computing device that contains
`
`reconfigurable components such as FPGAs and can. through reconfiguration.
`
`instantiate an algorithm as hardware.
`
`[0040] Reconfigurable Logic - is composed of an interconnection of
`
`_
`
`functional units. control, and storage that implements an algorithm and can be
`
`loaded into a Reconfigurable Processor.
`
`[0041]Functional Unit — is a set of logic that performs a specific operation.
`
`The operation may for example be arithmetic.
`
`logical, control. or data
`
`movement. Functional units are used as building blocks of reconfigurable logic.
`
`[0042] Macro — is another name for a functional unit.
`
`[0043] Memory Hierarchy — is a collection of memories
`
`[0044] Data prefetch Unit — is a functional unit that moves data between
`
`members of a memory hierarchy. The movement may be as simple as a copy.
`
`or as complex as an indirect indexed strided copy into a unit stride memory.
`
`[0045] Data access Unit — is a functional unit that accesses a component of a
`
`memory hierarchy, and delivers data directly to computational logic.
`
`\\\C5 v 80mm: - 65254 v2
`
`-9-
`
`12
`
`12
`
`

`

`[0046] Intelligent Memory Control Unit- is a control unit that has the ability to
`
`select data from its storage according to a variety of algorithms that can be
`
`selected by a data requester, such as a data prefetch unit.
`
`[0047] Bandwidth Efficiency - is defined as the percentage of contributory
`
`data transferred between two points. Contributory data is data that actually
`
`participates in the recipients processing.
`
`[0048] Bandwidth Utilization -— is defined as the percentage of maximum
`
`bandwidth between two points that is actually used to pass contributory data.
`
`2.
`
`Description
`
`[0049]A reconfigurable processor (RP) 100 implements direct executable logic
`
`(DEL) to perform computation, as well a memory hierarchy for maintaining input
`
`data and computational
`
`results.
`
`DEL is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch andior data access
`
`functionality for maximizing the parallelism inherent in the particular code. The
`
`term DEL may also be used to refer to the set of constructs such as code,
`
`data, configuration variables. and the like that can be loaded into RP 100 to
`
`cause RP 100 to implement a particular assemblage of functional elements.
`
`[0050] Figure 1 presents an RP 100. which may be implemented using field
`
`programmable gate arrays (FPGAs) or other reconfigurable logic devices, that
`
`can be
`
`configured and reconfigured to contain functional units
`
`and
`
`interconnecting circuits. and a memory hierarchy comprising on—board memory
`
`banks 104, on-chip block RAM 106. registers wires. and a connection 108 to
`
`external memory. On—chip reconfigurable components 102 create memory
`
`structures such as registers, FIFOs. wires and arrays using block RAM. Dual—
`
`ported memory 106 is shared between on-chip reconfigurable components 102.
`
`The reconfigurable processor 100 also implements user—defined computational
`
`NCS - 804041003} - 53254 v1
`
`'1 0'
`
`13
`
`13
`
`

`

`logic (e.g.. such as DEL 200 shown in Figure 2) constructed by programming
`
`an FPGA to implement a particuiar interconnection of computational functional
`
`units.
`
`In a particular implementation. a number of RPs 100 are implemented
`
`within a memory subsystem of a conventional computer, such as on devices
`
`that are physically installed in dual inline memory module (DIMM) sockets of a
`
`computer.
`In this manner the RPs 100 can be accessed by memory operations
`and so coexist well with a more conventional hardware platform.- It should be
`
`noted that. although the exemplary implementation of the present invention
`
`illustrated
`
`includes
`
`six banks
`
`of dual ported memory 104 and two
`
`reconfigurable components 102, any number of memory banks andlor
`
`reconfigurable components may be used depending upon the particular
`
`implementation or application.
`
`[0051]Any computer program.
`
`including
`
`complex graphics processing
`
`programs. word processing programs, database programs and the like,
`
`is a
`
`collection of algorithms that interact to implement desired functionality.
`
`In the
`
`common case in which static computing hardware resources are used (e.g., a
`
`conventional microprocessor), the computer program is compiled into a set of
`
`executable code (i.e., object code) units that are linked together to implement
`
`the computer program on the particular hardware resources. The executable
`
`code is generated specifically for a particular hardware platform.
`
`in this
`
`manner. the computer program is adapted to conform to the limitations of the
`
`static hardware platform. However,
`
`the compilation process makes many
`
`compromises based on the limitations of the static hardware platform.
`
`[0052]Alternatively, an algorithm can be defined in a high level language then
`
`compiled into DEL. DEL can be produced via a compiler from high level
`
`programming languages such as C or FORTRAN or may be designed using a
`
`hardware definition language such as Verilog. VHDL or a schematic capture
`
`tool. Computation is performed by reconfiguring a reconfigurable processor
`
`with the DEL and flowing data through the computation.
`
`in this manner. the
`
`\\\CS - summon - 68254 v2
`
`'1 1'
`
`14
`
`14
`
`

`

`hardware resources are essentially adapted to conform to the program rather
`
`than the program being adapted to conform to the hardware resources.
`
`[0053] For purposes of this description a single reconfigurable processor will be
`
`presented first. A sample of computational logic 201 is shown in Figure 2. This
`
`simple assemblage of functional units performs computation of two results
`
`("A+B“ and "A+B-(B*C)) from three input variables or operands "A", "B" and
`
`"C".
`
`In practice. computational units 201 can be implemented to perform very
`
`simple or arbitrarily complex computations. The input variables (operands) and
`output or result variables may be of any size necessary for a particular
`
`application. Theoretically. any number of operands and result variables may be
`
`usedlgenerated by a particular DEL. Great complexity of computation can be
`
`supported by adding additional reconfigurable chips and processors.
`
`[0054] For greatest performance the DEL 200 is constructed as parallel
`
`pipelined logic blocks composed of computational functional units capable of
`
`taking data and producing results with each clock pulse. The highest possible
`
`performance that can be achieved is computation of a set of results with each
`
`clock pulse. To achieve this. data should be available at the same rate the
`computation can consume the data. The rate at which data can be supplied to
`
`DEL 200 is determined, at least in significant part, by the memory bandwidth
`
`utilization and efficiency. Maximal computational performance can be achieved
`
`with parallel and pipelined DEL together with maximizing the memory
`
`bandwidth utilization and efficiency.
`
`Unlike conventional static hardware
`
`platforms, however,
`
`the memory hierarchy provided in
`
`a RP 100 is
`
`reconfigurable.
`
`In accordance with the present invention, through the use of
`
`data
`
`access
`
`units
`
`and
`
`associated memory
`
`hierarchy
`
`components.
`
`computational demands and memory bandwidth can be matched.
`
`[0055] High memory bandwidth efficiency is achieved when only data required
`
`for computation is moved within the memory hierarchy. Figure 3 shows a
`
`simple logic block 300 comprising computational functional units 301, control
`
`““25 - BMMMOJJ - 68254 V2
`
`'1 2'
`
`15
`
`15
`
`

`

`(not shown), and data access functional units 303. The data access unit 303
`
`presents data directly to the computational logic 301.
`
`In this manner. data is
`
`moved from a memory device 305 to the computational logic and from the
`
`computational logic back into a memory device 305 or block RAM memory 307
`
`within an RP 100.
`
`[0056] Figure 4 illustrates the logic block 300 with an addition of a data prefetch
`
`unit 401. The data prefetch unit 401 moves data from one member of the
`
`memory hierarchy 305 to another 308. Data prefetch unit 401 operates
`
`independently of other functional units 301, 302 and 303 and can therefore
`
`operate prior
`
`to.
`
`in parallel with. or after computational
`
`logic.
`
`This
`
`independence of operation permits hiding the latency associated with obtaining
`
`data for use in computation. The data prefetch unit deposits data into the
`
`memory hierarchy within RP 100, where computational logic 301. 302 and 303 .
`
`can access it through data access units.
`
`In the example of Figure 4, prefetch
`
`unit 401 is configured to deposit data into block RAM memory 308. Hence, the
`
`prefetch units 401 may be operated independently of togic block 300 that uses
`
`prefetched data.
`
`[0057] An important feature of the present invention is that many types of data
`
`prefetch units can be defined so that the prefetch hardware can be configured
`
`to conform to the needs of the algorithms currently implemented by the
`
`computational
`
`logic.
`
`The specific characteristics of

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket