`
`UNITED STATES DEPARTMENT OF COiVIMERCE
`
`United Stntcs Pntcnt nnd Trndcmnrk Office
`
`July 12, 2018
`
`THIS IS TO CERTIFY THAT ANNEXED IS A TRUE COPY FROM THE
`RECORDS OF THIS OFFICE OF THE FILE WRAPPER AND CONTENTS
`OF:
`
`APPLICATION NUMBER: 10/869,200
`FILING DATE: June 16, 2004
`PATENT NUMBER: 7149867
`ISSUE DATE: December 12, 2006
`
`Certified by
`
`Under Secretary of Commerce
`for Intellectual Property
`and Director of the United States
`Patent and Trademark Office
`
`Petitioners Amazon
`Ex. 1010, p. 1 of 399
`
`
`
`C
`~
`-
`"'t
`C
`
`UTILITY
`PATENT APPLICATION
`TRANSMITTAL
`(Only for new nonprovlslonal applications under
`37 CFR 1.53(b))
`
`PTO/SB/05 (04-04)
`Approved for use throuoh 07/31/2006
`
`Attorney Docket No.
`
`SRC028
`
`First Inventor
`
`Daniel Poznanovic et al.
`
`Title
`
`Express Mail Label No.
`
`EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`0
`..... 0
`MEMORY BANDWIDTH IN
`-~-
`°:o
`RECONFIGURABLE HARDWARE
`·o,
`:::::> co
`v~
`~o
`
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`(0 T"'"
`T-
`
`6. 0 Application Data Sheet. (See 37 CFR 1. 76
`
`7. 0 CD-ROM or CD-R in duplicate, large table
`or Computer Program (Appendix)
`8. Nucleotide and/or Amino Acid Sequence Submission
`(if applicable, all necessary)
`Computer Readable Form
`a.
`0
`Specification Sequence Listing on:
`b.
`0
`i. 0 CD-ROM or CD-R (2 copies); or
`ii. 0 paper
`
`C.
`
`Statements verifying identity of above copies
`
`APPLICATION ELEMENTS
`
`2. 0
`
`1. ~ Fee Transmittal Form
`(submit an original and a duplicate for fee processing)
`Applicant claims small entity status.
`See 37 CFR 1.27
`total pages _26_ l
`3. ~ Specification [
`(preferred Arrangement set forth below)
`- Descriptive title of the Invention
`- Cross References to Related Applications
`- Statement Regarding Fed sponsored R&D
`- Reference to sequence listing, a table, or a
`computer program listing appendix
`- Background of the Invention
`- Brief Summary of the Invention
`
`0
`ACCOMPANYING APPLICATION PARTS
`9. ~ Assignment Papers (coversheeUdocument(s))
`37 CFR. 3.73(b) Statement
`~ Power of
`10. 0
`(when there is an assignee)
`Attorney
`English Translation Document
`IDS & Form PTO/SB/OBA
`
`11. 0
`12. 0
`
`D Copies of IDS
`Citations
`
`- Brief Description of the Drawings
`- Detailed Description
`- Claim(s)
`- Abstract of the Disclosure
`4. ~ Drawing(s) [ total sheets 12 l
`total pages _3_ l
`5. ~ Oath or Declaration [
`a. ~ Newly executed (original or copy)
`b. 0 Copy from prior appl. (37 C.F.R. § 1.63(d))
`(for continuation/divisional with Box 18 completed)
`i. 0DELETION OF INVENTOR(S)
`Signed statement attached deleting
`inventor(s) named in prior application,
`see 37 C.F.R. §§ 1.63(d)(2) and1 .33(b).
`18. If a CONTINUING APPLICATION, check appropriate box, and supply the requisite information below and in a
`preliminary amendment, or In an Application Data Sheet under 37 CFR 1.76:
`D Continuation D Divisional D Continuation-in-part (CIP)
`Prior application information:
`Examiner:
`
`Preliminary Amendment
`13. 0
`14. ~ Return Receipt Postcard (MPEP 503)
`15. 0 Certified Copy of Priority Document(s)
`16. 0 Nonpublication Request Under 35 USC
`122(b)(2)(B)(i).Applicant must attach form PTO/SB/35
`17. ~ Other: Certificate of Mailini:1 by Express Mail
`
`of prior application No.:
`
`I
`
`Group/Art Unit:
`
`FOR CONTINUATION OR DIVISIONAL APPS only: The entire disclosure of the prior application, from which an oath or declaration is supplied under Box Sb. is considered a part of
`the disclosure of the accompanying continuation or divisional application and is hereby incorporated by reference. The incorporation can only be relied upon when a portion has been ·
`inadvertentlv omitted form the submitted application oarts.
`
`~ Customer Number 25235
`Name
`
`Address
`
`19. CORRESPONDENCEADDRESS
`or D Correspondence address below
`
`City
`
`State
`
`I
`I
`Telephone
`Country
`..--l Re~is~ation No.
`Name (Print/Type) William_.._ LC.u"'ida
`:--.....
`~ ~A /._L
`r )Jf~ 1
`(
`-
`
`(Signature)
`
`ZIP
`
`Fax
`
`Date
`
`29,664
`
`- ~7£xr--l
`
`C
`
`-
`
`_)
`
`\ \\CS · 80404/0033 - 68264 v I
`
`Petitioners Amazon
`Ex. 1010, p. 2 of 399
`
`
`
`EXPRESS MAIL NO. EV331755319US
`Attorney Docket No. SRC028
`Client/Matter No. 80404.0033.001
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Application of:
`
`Daniel Poznanovic, David E. Caliga,
`and Jeffrey Hammes
`
`Serial No. NEW
`
`Filed: Herewith
`
`For: SYSTEM AND METHOD OF
`ENHANCING EFFICIENCY AND
`UTILIZATION OF MEMORY
`BANDWIDTH IN RECONFIGURABLE
`HARDWARE
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`1.
`2.
`3.
`4.
`5.
`6.
`
`The undersigned hereby certifies that the following documents:
`Utility Patent Application Transmittal;
`Fee Transmittal and $928 filing fee;
`Utility Patent Application- 22 pgs. Spec, 3 pgs. Claims, 1 pg. Abstract;
`Executed Declaration for Utility Patent Application;
`12 sheets of drawings;
`Recordation Form Cover Sheet PTO 1595 with Executed
`Assignment and Recording Fee of $40.00;
`'J.
`Return postcard; and
`Certificate of Mailing By Express Mail
`8.
`relating to the above application, were deposited as "Express Mail", Mailing Label
`No. EV331755319US, with the United States Postal Service, addressed to Commissioner
`for Patents, P.O. Box 1450, Alexandria, VA 22313-1 50.
`
`One Tabor Center
`1200 17th Street, Suite lpOO
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`\\ \CS · 80404/0033 · 68264 v I
`
`Petitioners Amazon
`Ex. 1010, p. 3 of 399
`
`
`
`.....
`
`0)
`0,
`0)
`(!)
`
`FEE TRANSMITTAL
`for FY 2004
`e~e 10/01/2003. Patent fees are subject to annual revision
`
`0
`0)
`......
`0)
`0
`~
`
`Cl) -
`
`r=r'.'.j A~licant claims small entity status. See 37 CFR 1.27
`I($) gss.oo
`
`TOTAL AMOUNT OF PAYMENT
`
`I
`I
`
`Application Number
`
`Filing Date
`
`Complete if Known
`--------
`Herewith
`
`First Named Inventor
`
`Daniel Poznanovic et al.
`
`Examiner Name
`
`Group/ Art Unit
`
`Attorney Docket No.
`
`SRC028
`
`FEE CALCULATION (continued)
`3. ADDITIONAL FEES
`
`Large
`Entity
`Fee($)
`
`Small
`Entity
`Fee($)
`
`Fee Description
`
`Fee Paid
`
`130
`50
`
`130
`
`2,520
`
`920'
`
`1,840'
`
`110
`420
`
`950
`
`1,480
`
`65
`25
`
`Surcharge - late filing fee or oath
`Surcharge - late provisional filing fee or
`cover sheet
`130 Non-English specification
`
`2,520
`
`For filing a request for ex parte
`reexamination
`920' Requesting publication of SIR prior to
`Examiner action
`1,840' Requesting publication of SRI after
`Examiner action
`
`55
`210
`
`475
`
`740
`
`Extension for reply within first month
`Extension for reply within second month
`
`Extension for reply within third month
`
`Extension for reply within fourth month
`
`2,010
`
`1,005 Extension for reply within fifth month
`
`330
`
`330
`
`165 Notice of Appeal
`
`165
`
`Filing a brief in support of an appeal
`
`145 Request for oral hearing
`
`METHOD OF PAYMENT (check all that apply)
`181 check O credit card O money order O other O none
`0 Deposit Account
`
`Account
`Number
`
`Deposit
`Account
`Name
`
`50-1123
`
`Hogan & Hartson L.L.P.
`
`Deposit I
`I
`
`Toe Director is authorized to: (check all that apply)
`0 Charge fee(s) indicated below 1:8] Credit any overpayments
`1:8] Charge any additional fee(s) or any underpayment of fee(s)
`0 Charge fee(s) indicated below, except for the filing fee to the above-
`identified deposit account
`
`FEE CALCULATION
`
`Fee Description
`
`1. BASIC FILING FEE
`Small
`Large
`Entity Fee
`Entity Fee
`($)
`($)
`
`770
`340
`
`530
`
`385 Utility Filing Fee
`
`170 Design filing fee
`
`265
`
`Plant filing fee
`
`Fee Paid
`
`770.00
`
`770
`
`160
`
`385 Reissue filing fee
`
`80
`
`Provisional filing fee
`
`SUBTOTAL (1) I ($) 770.00
`2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE
`Fee from
`Fee Paid
`below
`
`Extra Claims
`
`.
`
`Total Claims ~w · · Er~r~
`
`lndepende.nt
`Claims
`Multiple Dependent
`
`-3"=
`
`X
`
`86
`
`=
`
`86.00
`
`"or number previously paid, if greater; For Reissues, see below
`Large Entity
`Fee Description
`Small Entity
`Fee($)
`Fee 1$1
`18
`9
`
`Claims in excess of 20
`
`86
`
`290
`86
`
`18
`
`43
`
`145
`43
`
`9
`
`Independent claims in excess of 3
`
`Multiple dependent claim, if not paid
`"Reissue independent claims over
`original patent
`"Reissue claims in excess of 20 and over
`original patent
`
`SUBTOTAL (2)
`
`I ($) 158.00
`
`I
`
`290
`
`1,510
`
`110
`1,330
`1,330
`
`480
`
`640
`
`130
`
`50
`180
`40
`
`770
`
`770
`
`770
`900
`
`1;510
`
`Petition to institute a public use
`proceeding
`Petition to revive - unavoidable
`55
`Petition to revive - unintentional
`665
`664 Utility issue fee (or reissue)
`
`240 Design issue fee
`
`320
`
`Plant issue fee
`
`130
`
`Petitions to the Commissioner
`
`Processing fee under 37 CFR 1.17(q)
`50
`Submission of Info Disclosure Simi
`180
`40 Recording each patent assignment per
`property (times number of properties)
`Filing a submission after final rejection (37
`CFR § 1.129(a))
`For each additional invention to be
`examined (37 CFR §1 .129(b))
`385 Request for Continued Examination
`900 Request for expedited examination of a
`design application
`
`385
`
`385
`
`.....................
`
`Other fee (specify)
`
`'Reduced by Basic Fling Fee Paid
`
`SUBTOTAL (3)
`
`40.00
`
`I ($) 40.00
`
`I
`
`Comolet"' fif anolicablel
`SUBMITTED BY
`Name (Print/Typrl'Vil~m ,,•
`II . '
`•
`
`-
`/"-l
`Signature I XVffe ~~-~ ~~ ]~
`
`~
`
`I Registration No.
`
`(Attorney/Agent)
`
`I 29,664
`
`Telep~. (719) 448-5900
`
`,A - loeY
`
`Date /r.
`
`u
`
`\\\CS· 80404/0033 • 68264 vl
`
`Petitioners Amazon
`Ex. 1010, p. 4 of 399
`
`
`
`..
`
`PATENT APPLICATION
`ATTORNEY DOCKET No. SRC028
`Client/Matter No. 80404.0033.001
`Express Mail Label No. EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING EFFICIENCY
`AND UTILIZATION OF MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`Related Applications.
`1.
`[0001] The present invention claims the benefit of U.S. Provisional Patent
`application Serial No. 60/479,339 filed on June 18, 2003, which is incorporated
`herein by reference in its entirety.
`
`BACKGROUND OF THE INVENTION
`
`1.
`
`Field of the Invention.
`
`[0002] The present invention relates, in general, to enhancing the efficiency and
`utilization of memory bandwidth in reconfigurable hardware. More specifically,
`to
`implementing explicit memory hierarchies
`in
`the
`invention
`relates
`
`reconfigurable processors that make efficient use of off-board, on-board, on(cid:173)
`
`chip storage and available algorithm locality. These explicit memory hierarchies
`
`avoid many of the tradeoffs and complexities found in the traditional memory
`
`hierarchies of microprocessors.
`
`2.
`
`Relevant Background .
`
`. [0003] Over the past 30 years, microprocessors have enjoyed annual
`performance gains averaging about 50% per year. Most of the gains can be
`attributed to higher processor clock speeds, more memory bandwidth and
`increasing utilization of instruction level parallelism (ILP) at execution time.
`
`[0004] As microprocessors and other dense logic devices (DLDs) consume
`
`data at ever-increasing rates it becomes more of a challenge to design memory
`hierarchies that can keep up.
`Two measures of the gap between the
`
`microprocessor and memory hierarchy are bandwidth efficiency and bandwidth
`
`\\\CS - 80404/0033 - 68254 v2
`
`1
`
`Petitioners Amazon
`Ex. 1010, p. 5 of 399
`
`
`
`utilization. Bandwidth efficiency refers to the ability to exploit available locality
`
`in a program or algorithm.
`
`In the ideal situation, when there is maximum
`
`bandwidth efficiency, all available locality is utilized. Bandwidth utilization refers
`to the amount of memory bandwidth that is utilized during a calculation.
`Maximum bandwidth utilization occurs when all available memory bandwidth is
`
`utilized.
`
`[0005] Potential performance gains from using a faster microprocessor can be
`reduced or even negated by a corresponding drop in bandwidth efficiency and
`bandwidth utilization. Thus, there has been significant effort spent on the
`development of memory hierarchies that can maintain high bandwidth efficiency
`and utilization with faster microprocessors.
`
`[0006] One approach to improving bandwidth efficiency and utilization in
`memory hierarchies has been to develop ever more powerful processor
`
`caches. These caches are high-speed memories (typically SRAM) in close
`
`proximity to the microprocessor that try to keep copies of instructions and data
`the microprocessor may soon need. The microprocessor can store and retrieve
`
`data from the cache at a much higher rate than from a slower, more distant
`
`main memory.
`
`[0007] In designing cache memories, there are a number of considerations to
`take into account. One consideration is the width of the cache line. Caches
`are arranged in lines to help hide memory latency and exploit spatial locality.
`When a load suffers a cache miss, a new cache line is loaded from main
`memory into the cache. The assumption is that a program being executed by
`the microprocessor has a high degree of spatial locality, making it likely that
`other memory locations in the cache line will also be required.
`
`[0008] For programs with a high degree of spatial locality (e.g., stride-one
`access), wide cache lines are more efficient since they reduce the number of
`
`times a processor has to suffer the latency of a memory access. However, for
`
`programs with lower levels of spatial locality, or random access, narrow lines
`
`\\\CS - 80404/0033 - 68254 v2
`
`-2-
`
`Petitioners Amazon
`Ex. 1010, p. 6 of 399
`
`
`
`are best as they reduce the wasted bandwidth from the unused neighbors in
`the cache line. Caches designed with wide cache lines perform well with
`programs that have a high degree of spatial locality, but generally have poor
`gather/scatter performance. Likewise, caches with short cache lines have good
`gather/scatter performance, but loose efficiency executing programs with high
`spatial locality because of the additional runs to the main memory.
`
`[0009] Another consideration in cache design is cache associativity, which
`
`refers to the mapping between locations in main memory and cache sectors.
`At one extreme of cache associativity is a direct-mapped cache, while at
`another extreme is a fully associative ca_che.
`In a direct mapped-cache, a
`specific memory location can be mapped to only a single cache line. Direct(cid:173)
`mapped caches have the advantage of being fast and easy to construct in
`logic. The disadvantage is that they suffer the maximum number of cache
`conflicts. At the other extreme, a fully associative cache allows a specific
`location in memory to be mapped to any cache line. Fully associative caches
`tend to be slower and more complex due to the large amount of comparison
`logic they need, but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative caches. A design
`point between the extremes is a k-set associative cache, where each memory
`location can map to k cache sectors. These caches generally have less
`fully associative caches, and reduce cache conflicts by
`overhead than
`
`increasing the value of k.
`
`[001 OJ Another consideration in cache design is how cache lines are replaced
`due to a capacity or conflict miss. In a direct-mapped cache, there is only one
`possible cache line that can be replaced due to a miss. However, in caches
`with higher levels of associativity, cache lines can be replaced in more that one
`way. The way the cache lines are replaced is referred to as the replaceme_nt
`policy.
`
`\\\CS • 80404/0033 - 68254 v2
`
`-3-
`
`Petitioners Amazon
`Ex. 1010, p. 7 of 399
`
`
`
`[0011] Options for the replacement policy include least recently used {LRU),
`random replacement, and first in-first out {FIFO). LRU is used in the majority
`
`of circumstances where the temporal locality set is smaller than the cache size,
`
`but it is normally more expensive to build in hardware than a random
`
`replacement cache. An LRU policy can also quickly degrade depending on the
`
`working set size. For example, consider an iterative application with a matrix
`size of N bytes running through a LRU cache of size M bytes. If N is less than
`M, then the policy has the desired behavior of 100% cache hits, however, if N is
`
`only slightly larger than M, the LRU policy results in 0% cache hits as lines are
`
`removed just as they are needed.
`
`[0012] Another consideration is deciding on a write policy for the cache. Write(cid:173)
`through caches send data through the cache hierarchy to main memory. This
`
`policy reduces cache coherency issues for multiple processor systems and is
`
`best suited for data that will not be re-read by the processor in the immediate
`
`future. In contrast, write-back caches place a copy of the data in the cache, but
`
`does not immediately update main memory. This type of caching works best
`
`when a data just written to the cache is quickly requested again by the
`
`processor.
`
`[0013] In addition to write-through and write-~ack caches, another kind of write
`
`policy is implemented in a write-allocate cache where a cache line is allocated
`
`on a write that misses in cache. Write-allocate caches improve performance
`
`when the microprocessor exhibits a lot of write followed by read behavior.
`
`However, when writes are not subsequently read, a write-allocate cache has a
`
`number of disadvantages: When a cache line is allocated, it is necessary to
`
`read the remaining values from main memory to complete the cache line. This
`
`adds unnecessary memory read traffic during store operations. Also, when the
`
`data is not read again, potentially useful data in the cache is displaced by the
`
`unused data.
`
`\\\CS - 80404/0033 - 68254 v2
`
`-4-
`
`Petitioners Amazon
`Ex. 1010, p. 8 of 399
`
`
`
`[0014] Another consideration is made between the size and the speed of the
`
`cache: small caches are typically much faster than larger caches, but store
`
`less data and fewer instructions. Less data means a greater chance the cache
`
`will not have data the microprocessor is requesting (i.e., a cache miss) which
`
`can slow everything down while the data is being retrieved from the main
`
`memory.
`
`[0015] Newer cache designs reduce the frequency of cache misses by trying to
`predict in advance the data that the microprocessor will request. An example of
`
`this type of cache is one that supports speculative execution and branch
`
`prediction. Speculative execution allows instructions that likely will be executed
`
`to start early based on branch prediction. Results are stored in a cache called
`
`a reorder buffer and retired if the branch was correctly predicted. Of course,
`
`when mis-predictions occur instruction and data bandwidth are wasted.
`
`[0016] There are additional considerations and tradeoffs in cache design, but it
`should be apparent from the considerations described hereinbefore that it is
`
`very difficult to design a single cache structure that is optimized for many
`
`different programs. This makes cache design particularly challenging for a
`
`multipurpose microprocessor that executes a wide variety of programs. Cache
`
`designers try to derive the program behavior of "average" program constructed
`
`from several actual programs that run on the microprocessor. The cache is
`
`optimized for the average program, but no actual program behaves exactly like
`
`the average program. As a result, the designed cache ends up being sub(cid:173)
`
`optimal for nearly every program actually executed by the microprocessor.
`
`Thus, there is a need for memory hierarchies that have data storage and
`
`retrieval characteristics that are optimized for actual programs executed by a
`
`processor.
`
`[0017] Designers trying to develop ever more efficient caches optimized for a
`variety of actual programs also face another problem: as caches add additional
`
`features, the overhead needed to implement the added features also grows.
`
`\\\CS • 80404/0033 - 68254 v2
`
`-5-
`
`Petitioners Amazon
`Ex. 1010, p. 9 of 399
`
`
`
`Caches today have so much overhead that microprocessor performance may
`
`be reaching a point of diminishing returns as the overhead starts to cut into
`
`performance.
`
`In the Intel Pentium Ill proc~ssor for example, more than half of
`
`the 10 million transistors are dedicated to instruction cache, branch prediction,
`
`out-of-order execution and supers'calar logic. The situation has prompted
`
`predictions that as microprocessors grow to a billion transistors per chip,
`
`performance increases will drop to about 20% per year. Such a prediction, if
`
`borne out, could have a significant impact on technology growth and the
`
`computer business.
`
`[0018) Thus, there is a growing need to develop improved memory hierarchies
`that limit the overhead of a memory hierarchy without also reducing bandwidth
`
`efficiency and utilization.
`
`SUMMARY OF THE INVENTION
`
`[0019]Accordingly, an embodiment of the invention includes a reconfigurable
`
`processor that includes a computational unit and a data access unit coupled to
`
`the computational unit, where the data access unit retrieves data from an on(cid:173)
`
`processor memory and supplies the data to the computational unit, and where
`
`the computational unit and the data access unit are configured by a program.
`
`[0020) The present invention also involves a reconfigurable processor that
`includes a first memory of a first type and a data prefetch unit coupled to the
`
`memory, where the data prefetch unit retrieves data from a second memory of
`
`a second type different from the first type, and the first and second memory
`
`types and the data prefetch unit are configured by a program.
`
`[0021) Another embodiment of the
`reconfigurable
`includes a
`invention
`hardware system that includes a common memory, also referred to as external
`
`memory, and one or more reconfigurable processors coupled to the common
`
`memory, where at least one of the reconfigurable processors includes a data
`
`prefetch unit to read and write data between the unit and the common memory,
`
`\\\CS • 80404/0033 - 68254 v2
`
`-6-
`
`Petitioners Amazon
`Ex. 1010, p. 10 of 399
`
`
`
`and where the data prefetch unit is configured by a program executed on the
`
`system.
`
`(0022] Another embodiment of the invention includes a method of transferring
`data that includes transferring data between a memory and a data prefetch unit
`in a reconfigurable processor, transferring data between the prefetch unit and a
`
`data access unit, and transferring the data between a computational unit and
`the data access unit, where the computational unit, data access unit and the
`data prefetch unit are configured by a program.
`
`(0023] Additional embodiments of the invention are set forth in part in the
`description that follows, and in part will become apparent to those skilled in the
`
`art upon examination of the following specification, or may be learned by the
`
`practice of the invention. The advantages of the invention may be realized and
`
`attained by means of the instrumentalities, combinations, compositions, and
`
`methods particularly pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`(0024] Figure 1 shows a reconfigurable processor in which the present
`invention may be implemented;
`
`.
`
`(0025] Figure 2 shows computational
`reconfigurable processor;
`
`logic as might be loaded
`
`into a
`
`(0026] Figure 3 shows a reconfigurable processor as in Figure 1, but with the
`additipn of data access units;
`
`(0027] Figure 4 shows a reconfigurable processor as in Figure 3, but with the
`addition of data prefetch units;
`
`(0028] Figure 5 shows reconfigurable processor with the inclusion of external
`memory;
`
`\\\CS - 80404/0033 - 68254 v2
`
`-7-
`
`Petitioners Amazon
`Ex. 1010, p. 11 of 399
`
`
`
`[0029] Figure 6 shows reconfigurable processors with external memory and
`
`with an intelligent memory controller;
`
`[0030] Figure 7 shows a reconfigurable processor having a combination of
`data prefetch units and data access units feeding computational logic;
`
`[0031] Figure 8 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory references;
`
`[0032] Figure 9A and Figure 98 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Y plane;
`
`[0033] Figure 1 OA and Figure 1 OB show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Z plane;
`
`[0034] Figure 11A and Figure 11 B show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in Y-Z plane;
`
`[0035] Figure 12A and Figure 128 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in a mini-cube;
`
`[0036] Figure 13 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform indirect memory references;
`
`[0037] Figure 14 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory reference together with computation.
`
`\\\CS • 80404/0033 • 68254 v2
`
`-8-
`
`Petitioners Amazon
`Ex. 1010, p. 12 of 399
`
`
`
`DETAILED DESCRIPTION
`
`1.
`
`Definitions:
`
`[0038) Direct execution logic (DEL) -
`
`is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`functional units
`interconnection among selected
`optimized
`to
`in order
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code.
`
`[0039) Reconfigurable Processor -
`reconfigurable components such as FPGAs and can, through reconfiguration,
`
`is a computing device that contains
`
`instantiate an algorithm as hardware.
`
`[0040) Reconfigurable Logic -
`functional units, control, and storage that implements an algorithm and can be
`
`is composed of an
`
`interconnection of
`
`loaded into a Reconfigurable Processor.
`
`[0041) Functional Unit -
`The operation may for example be arithmetic,
`
`is a set of logic that performs a specific operation.
`
`logical, control, or data
`
`movement. Functional units are used as building blocks of reconfigurable logic.
`
`[0042) Macro - is another name for a functional unit.
`
`[0043) Memory Hierarchy - is a collection of memories
`
`[0044) Data prefetch Unit -
`
`is a functional unit that moves data between
`
`members of a memory hierarchy. The movement may be as simple as a copy,
`or as complex as an indirect indexed strided copy into a unit stride memory.
`
`[0045] Data access Unit - is a functional unit that accesses a component of a
`
`memory hierarchy, and delivers data directly to computational logic.
`
`\\\CS • 80404/0033 • 68254 v2
`
`-9-
`
`Petitioners Amazon
`Ex. 1010, p. 13 of 399
`
`
`
`[0046] Intelligent Memory Control Unit- is a control unit that has the ability to
`
`select data from its storage according to a variety of algorithms that can be
`
`selected by a data requester, such as a data prefetch unit.
`
`[0047] Bandwidth Efficiency -
`
`is defined as the percentage of contributory
`
`data transferred between two points. Contributory data is data that actually
`
`participates in the recipients processing.
`
`[0048] Bandwidth Utilization -
`bandwidth between two points that is actually used to pass contributory data.
`
`is defined as the percentage of maximum
`
`2.
`
`Description
`
`[0049] A reconfigurable processor (RP) 100 implements direct executable logic
`(DEL) to perform computation, as well a memory hierarchy for maintaining input
`
`data and computational results. DEL is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`functional units
`interconnection among selected
`optimized
`to
`in order
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code. The
`
`term DEL may also be used to refer to the set of constructs such as code,
`data, configuration variables, and the like that can be loaded into RP 100 to
`cause RP 100 to implement a particular assemblage of functional elements.
`
`[0050] Figure 1 presents an RP 100, which may be implemented using field
`programmable gate arrays (FPGAs) or other reconfigurable logic devices, that
`can be configured and
`reconfigured
`to contain
`functional units and
`interconnecting circuits, and a memory hierarchy comprising on-board memory
`banks 104, on-chip block RAM 106, registers wires, and a connection 108 to
`external memory. On-chip reconfigurable components 102 create memory
`
`structures such as registers, FIFOs, wires and arrays using block RAM. Dual(cid:173)
`
`ported memory 106 is shared between on-chip reconfigurable components 102.
`
`The reconfigurable processor 100 also implements user-defined computational
`
`\\\CS - 80404/0033 - 68254 v2
`
`-10-
`
`Petitioners Amazon
`Ex. 1010, p. 14 of 399
`
`
`
`logic (e.g., such as DEL 200 shown in Figure 2) constructed by programming
`
`an FPGA to implement a particular interconnection of computational functional
`In a particular implementation, a number of RPs 100 are implemented
`units.
`
`within a memory subsystem of a conventional computer, such as on devices
`that are physically installed in dual inline memory module (DIMM) sockets of a
`computer. In this manner the RPs 100 can be accessed by memory operations
`
`It should be
`and so coexist well with a more conventional hardware platform.
`noted that, although the exemplary implementation of the present invention
`two
`includes six banks of dual ported memory 104 and
`illustrated
`reconfigurable components 102, any number of memory banks and/or
`reconfigurable components may be used depending upon the particular
`implementation or application.
`
`[0051] Any computer program,
`including complex graphics processing
`programs, word processing programs, database programs and the like, is a
`collection of algorithms that interact to implement desired functionality.
`In the
`common case in which static computing hardware resources are used (e.g., a
`
`conventional microprocessor), the computer program is compiled into a set of
`executable code (i.e., object code) units that are linked together to implement
`the computer program on the particular hardware resources. The executable
`
`code is generated specifically for a particular hardware platform.
`In this
`manner, the computer program is adapted to conform to the limitations of the
`static hardware platform. However, the compilation process makes many
`compromises based on the limitations of the static hardware platform.
`
`[0052] Alternatively, a_n algorithm can be defined in a high level language then
`compiled into DEL. DEL can be produced via a compiler from high level
`programming languages such as C or FORTRAN or may be designed using a
`hardware definition language such as Verilog, VHDL or a schematic capture
`tool. Computation is performed by reconfiguring a reconfigurable processor
`In this manner, the
`with the DEL and flowing data through the computation.
`
`\\\CS - 80404/0033 - 68254 v2
`
`-11-
`
`Petitioners Amazon
`Ex. 1010, p. 15 of 399
`
`
`
`hardware resources are essentially adapted to conform to the program rather
`than the program being adapted to conform to the hardware resources.
`
`[0053] For purposes of this description a single reconfigurable processor will be
`
`presented first. A sample of computational logic 201 is shown in Figure 2. This
`
`simple assemblage of functional units performs computation of two results
`("A+B" and "A+B-(B*C)) from three input variables or operands "A", "B" and
`
`"C".
`
`In practice, computational units 201 can be implemented to perform very
`
`simple or arbitrarily complex computations. The input variables (operands) and
`output or result variables may be of any size necessary for a particular
`application. Theoretically, any number of operands and result variables may be
`
`used/generated by a particular DEL. Great complexity of computation can be
`supported by adding additional reconfigurable chips and processors.
`
`[0054] For greatest performance the DEL 200 is constructed as parallel
`pipelined logic blocks composed of computational functional units capable of
`
`taking data and producing results with each clock pulse. The highest possible
`
`performance that can be achieved is computation of a set of results with each
`
`clock pulse. To achieve this, data should be available at the same rate the
`