`US007149867B2
`
`c12) United States Patent
`Poznanovic et al.
`
`(IO) Patent No.:
`(45) Date of Patent:
`
`US 7,149,867 B2
`Dec. 12, 2006
`
`(54) SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`(75)
`
`Inventors: Daniel Poznanovic, Colorado Springs,
`CO (US); David E. Caliga, Colorado
`Springs, CO (US); Jeffrey Hammes,
`Colorado Springs, CO (US)
`
`(73) Assignee: SRC Computers, Inc., Colorado
`Springs, CO (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by O days.
`
`(21) Appl. No.: 10/869,200
`
`(22) Filed:
`
`Jun. 16, 2004
`
`(65)
`
`Prior Publication Data
`
`US 2004/0260884 Al
`
`Dec. 23, 2004
`
`Related U.S. Application Data
`
`(60) Provisional application No. 60/479,339, filed on Jun.
`18, 2003.
`
`(51)
`
`Int. Cl.
`G06F 12100
`(2006.01)
`(52) U.S. Cl. ....................................... 711/170; 711/154
`(58) Field of Classification Search ........ 711/170-173;
`712/15
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`6/2000 Ruppenthal et al.
`6,076,152 A
`6,243,791 Bl*
`6/2001 Vondran, Jr ................. 711/120
`6,247,110 Bl
`6/2001 Ruppenthal et al.
`6,460,122 Bl* 10/2002 Otterness et al. ........... 711/154
`6,507,898 Bl*
`1/2003 Gibson et al.
`711/168
`6,594,736 Bl
`6/2003 Parks
`
`6,714,041 Bl*
`3/2004 Darling et al ................. 326/38
`3/2003 Gschwind et al. .......... 711/118
`2003/0046492 Al*
`2003/0046530 Al*
`3/2003 Poznanovic ................. 713/100
`2003/0084244 Al*
`5/2003 Paulraj ....................... 711/118
`2003/0088737 Al*
`5/2003 Burton ....................... 711/118
`2003/0208658 Al* 11/2003 Magoshi ..................... 711/122
`2005/0044327 Al*
`2/2005 Howard et al. ............. 711/147
`
`OTHER PUBLICATIONS
`
`"Sununary: The Cache Read/Write Process," The PC Guide, 2001,
`www.pcguide.com/ref/mbsys/cache/func.htm.*
`Chien et al., "Safe and Protected Execution for the Morph/ AMRM
`Reconfigurable Processor," IEEE, 1999, pp. 1-13.*
`IEEE 100: The Authoritative Dictionary of IEEE Standards Terms,
`Standards Information Network, 2000, pp. 874.*
`
`* cited by examiner
`
`Primary Examiner--Gary Portka
`Assistant Examiner-Shane M. Thomas
`(7 4) Attorney, Agent, or Firm-William J. Kubida; Michael
`C. Martensen; Hogan & Hartson LLP
`
`(57)
`
`ABSTRACT
`
`A reconfigurable processor that includes a computational
`unit and a data prefetch unit coupled to the computational
`unit, where the data prefetch unit retrieves data from a
`memory and supplies the data to the computational unit
`through memory and a data access unit, and where the data
`prefetch unit, memory, and data access unit is configured by
`a program. Also, a reconfigurable hardware system that
`includes a common memory; and one or more reconfig(cid:173)
`urable processors coupled to the common memory, where at
`least one of the reconfigurable processors includes a data
`prefetch unit to read and write data between the unit and the
`common memory, and where the data prefetch unit is
`configured by a program executed on the system. In addi(cid:173)
`tion, a method of transferring data that includes transferring
`data between a memory and a data prefetch unit in a
`reconfigurable processor; and transferring the data between
`a computational unit and the data prefetch unit.
`
`19 Claims, 12 Drawing Sheets
`
`300
`
`'
`'
`'
`'
`i I ME;ORY I
`
`.. :
`' ' ' ' ' '
`
`- - - - - - - - - - - - I
`
`TO DOWNSTREAM
`LOOIC
`
`Petitioners Amazon
`Ex. 1001, p. 1
`
`
`
`--...l = N
`
`0--,
`00
`\0
`"'""' ~
`'-"--...l
`d r.,;_
`
`....
`0 ....
`....
`.....
`rJJ =- ('D
`
`('D
`
`N
`
`• ......................................................................... ~ ....... , ................................................................................................... ·
`
`• •
`
`FIG. 1
`
`FIG. 2
`
`A+B-(B*C)
`
`A+B
`
`'··········+················· ..... ; .................................... ..
`
`DUAL-PORTED
`
`106 ~ MEMORY
`
`• •
`
`(e.g., FPGA)
`
`USER LOGIC 2
`
`I
`
`)
`102
`
`•
`
`• •
`
`=PGA)
`.OGIC 1
`
`(e.g., I
`USERL
`
`i
`)2
`1(
`
`~ ....
`c ('D
`
`O'I
`0
`0
`N
`'-"
`N
`
`~ = ~
`
`~
`~
`~
`•
`00
`
`e •
`
`{ 201c
`
`B
`
`I
`R(
`)M
`I cm
`IFIG
`A
`~
`)M ~
`
`1CODE
`
`~
`
`I
`
`---·-····-·········································
`
`,··-·······································································+······---------~--------····························
`
`•
`
`CONTROLLER I •
`
`MEMORY
`ON-BOARD
`
`DUAL PORTED
`
`SIX BANKS
`
`. .---._
`
`, .. 10'
`
`•
`•
`
`11 )8 ~ ..
`
`Petitioners Amazon
`Ex. 1001, p. 2
`
`
`
`--...l = N
`
`0--,
`00
`\0
`'"'"' ~
`'-"--...l
`d r.,;_
`
`....
`N 0 ....
`('D a
`~
`rJJ
`
`N
`
`~ ....
`c
`
`('D
`
`O'I
`0
`0
`N
`'-"
`N
`
`~ = ~
`
`~
`~
`~
`•
`00
`
`e •
`
`FIG. 3
`
`t-r--307
`
`I BLOCK RAM
`
`MEMORY
`
`MEMORY I
`
`BANKO
`
`1
`
`305
`
`ST_BRAM h.._r------303
`
`ST_BANK_D
`
`~1
`
`301
`
`~
`
`303
`
`301
`
`MULT
`
`ADD
`
`303
`~
`LD_BANK_A I LD_BANK_B I LD_BANK_C
`
`I MEMORY I
`
`BANKC
`
`I
`
`BANK A
`BANK B
`MEMORY 11 MEMORY
`
`305
`
`305
`
`305
`
`300 ,,----,__J_
`'
`' '
`' '
`
`Petitioners Amazon
`Ex. 1001, p. 3
`
`
`
`--...l = N
`
`0--,
`00
`\0
`"'""' ~
`'-"--...l
`d r.,;_
`
`~
`
`....
`0 ....
`('D a
`rJJ =(cid:173)
`
`N
`
`N .
`~ ....
`c
`
`('D
`
`O'I
`0
`0
`N
`
`~ = ~
`
`~
`~
`~
`•
`00
`
`e •
`
`LOGIC
`
`TO DOWNSTREAM
`
`FIG. 4
`
`~-----------~
`
`~307
`
`I
`
`308
`
`I ME~ORY I
`
`MEMORY
`
`RAM
`
`BLOCK
`
`303
`
`301
`
`301
`
`······--------..-------------~
`
`MULT
`
`ADD
`
`MEMORY
`
`BLOCK RAM
`
`I
`
`BANKO
`MEMORY
`
`305
`
`ST_BANK_D
`
`~
`
`302
`
`301
`
`300 ~
`
`t'
`
`403
`
`I
`I
`
`I PREFETCH
`
`UNIT
`
`PROCESSOR MEMORY
`
`RECONFIGURABLE
`
`TO UPSTREAM
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`1-------------1
`
`BANK F
`MEMORY
`
`305
`
`.---------'r'_
`\
`
`BANK F r 401
`
`i
`\
`
`STRIDED
`PREFETCH
`
`-------!
`
`-·------------------
`
`LD_BANK_C
`
`LD_BANK_B
`
`LO BANK A
`
`-
`
`-
`
`-------------------------~---------------------~
`
`BANKC
`MEMORY
`
`BANKB
`MEMORY
`
`BANK A
`MEMORY
`
`LOGIC
`
`TO UPSTREAM
`
`305
`
`305
`
`305
`
`~
`
`Petitioners Amazon
`Ex. 1001, p. 4
`
`
`
`--...l = N
`
`0--,
`00
`\0
`""'"' ~
`'-"--...l
`d r.,;_
`
`....
`0 ....
`.....
`rJJ =(cid:173)
`
`.i;...
`
`('D
`('D
`
`N
`
`~ ....
`c ('D
`
`O'I
`0
`0
`N
`'-"
`N
`
`~ = ~
`
`~
`~
`~
`•
`00
`
`e •
`
`FIG. 6
`
`FIG. 5
`
`307
`
`MEMORY
`
`BLOCK RAM
`
`BANKO
`MEMORY
`
`305
`
`307
`
`MEMORY
`
`BLOCK RAM
`
`BANKO
`MEMORY
`
`305
`
`a.-------------------✓---········--·----------✓•-------········-------------------
`
`-------------------------·······------
`
`301
`
`301
`
`300 ~
`
`ADD
`
`'
`
`301
`
`301
`
`301
`
`ADD
`
`I LO BANK A
`
`-, -
`
`!
`:·-·······-------
`
`303/,\....r
`
`303 ~ I LD_B':'"K A ""j" ~ I ~ I
`
`r----------------x··········
`
`BANKC
`MEMORY
`
`BANK B
`MEMORY
`
`305 ~, BANK A
`MEMORY
`
`BANKC
`MEMORY
`
`BANKB
`MEMORY
`
`BANK A
`305 ~ I MEMORY
`
`STRIDED
`PREFETCH
`
`STRIDED
`PREFETCH
`
`STRIDED
`PREFETCH
`
`601~
`
`STRIDE 1
`PREFETCH
`
`STRIDE 1
`PREFETCH
`
`STRIDE 1
`PREFETCH
`
`501 ~
`
`603
`
`INTELLIGENT MEMORY CONTROLLER
`
`EXTERNAL MEMORY
`
`EXTERNAL MEMORY
`
`Petitioners Amazon
`Ex. 1001, p. 5
`
`
`
`--...l = N
`
`0--,
`00
`\0
`"""" ~
`'-"--...l
`d r.,;_
`
`FIG. 7
`
`307
`
`MEMORY
`
`BLOCK RAM
`
`----, BANK D
`MEMORY
`
`305
`
`············---····· 7··············----·····/--................................ :
`
`I ,..-+...___---.
`
`ST_BANK_D
`
`+
`
`,
`' ' '
`
`'
`
`'
`
`301
`
`301
`
`MULT
`
`ADD
`
`' '
`-----.. ·---------------·
`
`....
`0 ....
`('D a
`rJJ =(cid:173)
`
`Ul
`
`N
`
`N .
`~ ....
`c
`
`('D
`
`O'I
`0
`0
`N
`
`~ = ~
`
`~
`~
`~
`•
`00
`
`e •
`
`300 ~ 301
`
`703 ~
`
`r·········
`
`STREAM
`
`STREAM
`
`STREAM
`
`705 ..-------.____.
`
`STRIDED
`PREFETCH
`
`STRIDED
`PREFETCH
`
`STRIDED
`PREFETCH
`
`701 ..-------.____.
`
`Petitioners Amazon
`Ex. 1001, p. 6
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 6 of 12
`
`US 7,149,867 B2
`
`Lt")
`0
`00
`
`00
`•
`(!)
`ti:
`
`.....
`0
`00
`
`0
`00
`
`.....
`0
`00
`
`0
`0 co
`
`0
`0 co
`
`\ 0
`0 co
`
`Petitioners Amazon
`Ex. 1001, p. 7
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 7 of 12
`
`US 7,149,867 B2
`
`I in N s ....J oO
`
`0::: ()
`
`0:::
`w
`u..
`u..
`::::)
`al
`0:::
`w
`u..
`(/) z
`~
`I-
`~
`()
`
`in ....-
`
`S ....J oO
`
`0::: ()
`
`Petitioners Amazon
`Ex. 1001, p. 8
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 8 of 12
`
`US 7,149,867 B2
`
`'
`
`'
`
`'
`
`'
`
`7C
`~
`
`~
`
`~
`
`~
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`'
`
`~
`
`~
`
`~
`
`~
`~~
`~
`
`~
`
`~
`~
`
`~
`
`~
`
`~~
`
`~
`
`~
`
`~
`
`' '
`
`LL
`
`a:: w
`LL =>
`CD
`a::
`w
`LL
`Cl) z
`~
`I-
`~ u
`
`NC"') s ...J
`00
`a:: u
`
`..- ("') s ...J
`00
`c:::U
`
`Petitioners Amazon
`Ex. 1001, p. 9
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 9 of 12
`
`US 7,149,867 B2
`
`NN
`
`$ ...J oO
`a::: (..)
`
`.,....
`.,....
`$....l
`oO
`a:::(..)
`
`q:
`
`~
`~
`•
`<.!)
`ti:
`
`a:::
`LU u..
`u..
`:::,
`[I)
`a:::
`LU
`u..
`Cl)
`z
`~
`I-
`~
`(..)
`
`(Q
`~
`~
`
`.
`a:
`
`<.!)
`
`Petitioners Amazon
`Ex. 1001, p. 10
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 10 of 12
`
`US 7,149,867 B2
`
`.....
`0:::
`0
`0
`<(
`
`Petitioners Amazon
`Ex. 1001, p. 11
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 11 of 12
`
`US 7,149,867 B2
`
`0:: w
`IL
`IL
`::,
`co
`0::
`w
`IL
`Cl) z
`~
`I-
`~
`(.)
`
`LO
`0
`.....
`C")
`
`Petitioners Amazon
`Ex. 1001, p. 12
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 12 of 12
`
`US 7,149,867 B2
`
`a::
`w
`LL
`LL
`::,
`co
`a::
`w
`LL
`(/') z
`~
`I-
`~
`(.)
`
`0
`.....
`'<I"
`
`LO
`0
`.....
`'<I"
`
`\_
`
`(")
`0
`.....
`'<I"
`
`~
`'I-
`•
`(!)
`
`a:
`
`Petitioners Amazon
`Ex. 1001, p. 13
`
`
`
`US 7,149,867 B2
`
`1
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`RELATED APPLICATIONS
`
`The present invention claims the benefit of U.S. Provi(cid:173)
`sional Patent application Ser. No. 60/479,339 filed on Jun.
`18, 2003, which is incorporated herein by reference in its
`entirety.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention relates, in general, to enhancing the
`efficiency and utilization of memory bandwidth in reconfig(cid:173)
`urable hardware. More specifically, the invention relates to
`implementing explicit memory hierarchies in reconfigurable
`processors that make efficient use of off-board, on-board,
`on-chip storage and available algorithm locality. These
`explicit memory hierarchies avoid many of the tradeoffs and
`complexities found in the traditional memory hierarchies of
`microprocessors.
`2. Relevant Background
`Over the past 30 years, microprocessors have enjoyed
`annual performance gains averaging about 50% per year.
`Most of the gains can be attributed to higher processor clock
`speeds, more memory bandwidth and increasing utilization
`of instruction level parallelism (ILP) at execution time.
`As microprocessors and other dense logic devices (DLDs)
`consume data at ever-increasing rates it becomes more of a
`challenge to design memory hierarchies that can keep up.
`Two measures of the gap between the microprocessor and
`memory hierarchy are bandwidth efficiency and bandwidth
`utilization. Bandwidth efficiency refers to the ability to
`exploit available locality in a program or algorithm. In the
`ideal situation, when there is maximum bandwidth effi(cid:173)
`ciency, all available locality is utilized. Bandwidth utiliza(cid:173)
`tion refers to the amount of memory bandwidth that is
`utilized during a calculation. Maximum bandwidth utiliza(cid:173)
`tion occurs when all available memory bandwidth is uti(cid:173)
`lized.
`Potential performance gains from using a faster micro(cid:173)
`processor can be reduced or even negated by a correspond(cid:173)
`ing drop in bandwidth efficiency and bandwidth utilization.
`Thus, there has been significant effort spent on the devel(cid:173)
`opment of memory hierarchies that can maintain high band(cid:173)
`width efficiency and utilization with faster microprocessors.
`One approach to improving bandwidth efficiency and
`utilization in memory hierarchies has been to develop ever
`more powerful processor caches. These caches are high(cid:173)
`speed memories (typically SRAM) in close proximity to the
`microprocessor that try to keep copies of instructions and
`data the microprocessor may soon need. The microprocessor
`can store and retrieve data from the cache at a much higher
`rate than from a slower, more distant main memory.
`In designing cache memories, there are a number of
`considerations to take into account. One consideration is the
`width of the cache line. Caches are arranged in lines to help
`hide memory latency and exploit spatial locality. When a
`load suffers a cache miss, a new cache line is loaded from
`main memory into the cache. The assumption is that a
`program being executed by the microprocessor has a high
`degree of spatial locality, making it likely that other memory
`locations in the cache line will also be required.
`
`5
`
`2
`For programs with a high degree of spatial locality (e.g.,
`stride-one access), wide cache lines are more efficient since
`they reduce the number of times a processor has to suffer the
`latency of a memory access. However, for programs with
`lower levels of spatial locality, or random access, narrow
`lines are best as they reduce the wasted bandwidth from the
`unused neighbors in the cache line. Caches designed with
`wide cache lines perform well with programs that have a
`high degree of spatial locality, but generally have poor
`10 gather/scatter performance. Likewise, caches with short
`cache lines have good gather/scatter performance, but loose
`efficiency executing programs with high spatial locality
`because of the additional runs to the main memory.
`Another consideration in cache design is cache associa-
`15 tivity, which refers to the mapping between locations in
`main memory and cache sectors. At one extreme of cache
`associativity is a direct-mapped cache, while at another
`extreme is a fully associative cache. In a direct mapped(cid:173)
`cache, a specific memory location can be mapped to only a
`20 single cache line. Direct-mapped caches have the advantage
`of being fast and easy to construct in logic. The disadvantage
`is that they suffer the maximum number of cache conflicts.
`At the other extreme, a fully associative cache allows a
`specific location in memory to be mapped to any cache line.
`25 Fully associative caches tend to be slower and more com(cid:173)
`plex due to the large amount of comparison logic they need,
`but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative
`caches. A design point between the extremes is a k-set
`30 associative cache, where each memory location can map to
`k cache sectors. These caches generally have less overhead
`than fully associative caches, and reduce cache conflicts by
`increasing the value of k.
`Another consideration in cache design is how cache lines
`35 are replaced due to a capacity or conflict miss. In a direct(cid:173)
`mapped cache, there is only one possible cache line that can
`be replaced due to a miss. However, in caches with higher
`levels of associativity, cache lines can be replaced in more
`that one way. The way the cache lines are replaced is
`40 referred to as the replacement policy.
`Options for the replacement policy include least recently
`used (LRU), random replacement, and first in-first out
`(FIFO). LRU is used in the majority of circumstances where
`the temporal locality set is smaller than the cache size, but
`45 it is normally more expensive to build in hardware than a
`random replacement cache. An LRU policy can also quickly
`degrade depending on the working set size. For example,
`consider an iterative application with a matrix size of N
`bytes running through a LRU cache of size M bytes. If N is
`50 less than M, then the policy has the desired behavior of
`100% cache hits, however, if N is only slightly larger than
`M, the LRU policy results in 0% cache hits as lines are
`removed just as they are needed.
`Another consideration is deciding on a write policy for the
`55 cache. Write-through caches send data through the cache
`hierarchy to main memory. This policy reduces cache coher(cid:173)
`ency issues for multiple processor systems and is best suited
`for data that will not be re-read by the processor in the
`immediate future. In contrast, write-back caches place a
`60 copy of the data in the cache, but does not immediately
`update main memory. This type of caching works best when
`a data just written to the cache is quickly requested again by
`the processor.
`In addition to write-through and write-back caches,
`65 another kind of write policy is implemented in a write(cid:173)
`allocate cache where a cache line is allocated on a write that
`misses in cache. Write-allocate caches improve performance
`
`Petitioners Amazon
`Ex. 1001, p. 14
`
`
`
`US 7,149,867 B2
`
`4
`processor memory and supplies the data to the computa(cid:173)
`tional unit, and where the computational unit and the data
`access unit are configured by a program.
`The present invention also involves a reconfigurable
`5 processor that includes a first memory of a first type and a
`data prefetch unit coupled to the memory, where the data
`prefetch unit retrieves data from a second memory of a
`second type different from the first type, and the first and
`second memory types and the data prefetch unit are config-
`10 ured by a program.
`Another embodiment of the invention includes a recon-
`figurable hardware system that includes a common memory,
`also referred to as external memory, and one or more
`reconfigurable processors coupled to the common memory,
`15 where at least one of the reconfigurable processors includes
`a data prefetch unit to read and write data between the unit
`and the common memory, and where the data prefetch unit
`is configured by a program executed on the system.
`Another embodiment of the invention includes a method
`of transferring data that includes transferring data between a
`memory and a data prefetch unit in a reconfigurable pro(cid:173)
`cessor, transferring data between the prefetch unit and a data
`access unit, and transferring the data between a computa(cid:173)
`tional unit and the data access unit, where the computational
`25 unit, data access unit and the data prefetch unit are config(cid:173)
`ured by a program.
`Additional embodiments of the invention are set forth in
`part in the description that follows, and in part will become
`apparent to those skilled in the art upon examination of the
`following specification, or may be learned by the practice of
`the invention. The advantages of the invention may be
`realized and attained by means of the instrumentalities,
`combinations, compositions, and methods particularly
`pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`3
`when the microprocessor exhibits a lot of write followed by
`read behavior. However, when writes are not subsequently
`read, a write-allocate cache has a number of disadvantages:
`When a cache line is allocated, it is necessary to read the
`remaining values from main memory to complete the cache
`line. This adds unnecessary memory read traffic during store
`operations. Also, when the data is not read again, potentially
`useful data in the cache is displaced by the unused data.
`Another consideration is made between the size and the
`speed of the cache: small caches are typically much faster
`than larger caches, but store less data and fewer instructions.
`Less data means a greater chance the cache will not have
`data the microprocessor is requesting (i.e., a cache miss)
`which can slow everything down while the data is being
`retrieved from the main memory.
`Newer cache designs reduce the frequency of cache
`misses by trying to predict in advance the data that the
`microprocessor will request. An example of this type of
`cache is one that supports speculative execution and branch
`prediction. Speculative execution allows instructions that 20
`likely will be executed to start early based on branch
`prediction. Results are stored in a cache called a reorder
`buffer and retired if the branch was correctly predicted. Of
`course, when mis-predictions occur instruction and data
`bandwidth are wasted.
`There are additional considerations and tradeoffs in cache
`design, but it should be apparent from the considerations
`described hereinbefore that it is very difficult to design a
`single cache structure that is optimized for many different
`programs. This makes cache design particularly challenging 30
`for a multipurpose microprocessor that executes a wide
`variety of programs. Cache designers try to derive the
`program behavior of "average" program constructed from
`several actual programs that run on the microprocessor. The
`cache is optimized for the average program, but no actual 35
`program behaves exactly like the average program. As a
`result, the designed cache ends up being sub-optimal for
`nearly every program actually executed by the micropro(cid:173)
`cessor. Thus, there is a need for memory hierarchies that
`have data storage and retrieval characteristics that are opti- 40
`mized for actual programs executed by a processor.
`Designers trying to develop ever more efficient caches
`optimized for a variety of actual programs also face another
`problem: as caches add additional features, the overhead
`needed to implement the added features also grows. Caches 45
`today have so much overhead that microprocessor perfor(cid:173)
`mance may be reaching a point of diminishing returns as the
`overhead starts to cut into performance. In the Intel Pentium
`III processor for example, more than half of the 10 million
`transistors are dedicated to instruction cache, branch pre- 50
`diction, out-of-order execution and superscalar logic. The
`situation has prompted predictions that as microprocessors
`grow to a billion transistors per chip, performance increases
`will drop to about 20% per year. Such a prediction, if borne
`out, could have a significant impact on technology growth 55
`and the computer business.
`Thus, there is a growing need to develop improved
`memory hierarchies that limit the overhead of a memory
`hierarchy without also reducing bandwidth efficiency and
`utilization.
`
`FIG. 1 shows a reconfigurable processor in which the
`present invention may be implemented;
`FIG. 2 shows computational logic as might be loaded into
`a reconfigurable processor;
`FIG. 3 shows a reconfigurable processor as in FIG. 1, but
`with the addition of data access units;
`FIG. 4 shows a reconfigurable processor as in FIG. 3, but
`with the addition of data prefetch units;
`FIG. 5 shows reconfigurable processor with the inclusion
`of external memory;
`FIG. 6 shows reconfigurable processors with external
`memory and with an intelligent memory controller;
`FIG. 7 shows a reconfigurable processor having a com(cid:173)
`bination of data prefetch units and data access units feeding
`computational logic;
`FIG. 8 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`references;
`FIG. 9A and FIG. 9B show the bandwidth efficiency and
`utilization gains obtained when utilizing a data prefetch unit
`60 and an intelligent memory controller to perform subset
`memory references in X-Y plane;
`FIG. lOA and FIG. 10B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset
`65 memory references in X-Z plane;
`FIG. llA and FIG. 11B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`
`SUMMARY OF THE INVENTION
`
`Accordingly, an embodiment of the invention includes a
`reconfigurable processor that includes a computational unit
`and a data access unit coupled to the computational unit,
`where the data access unit retrieves data from an on-
`
`Petitioners Amazon
`Ex. 1001, p. 15
`
`
`
`US 7,149,867 B2
`
`5
`unit and an intelligent memory controller to perform subset
`memory references in Y-Z plane;
`FIG. 12A and FIG. 12B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset 5
`memory references in a mini-cube;
`FIG. 13 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform indirect memory
`references;
`FIG. 14 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`reference together with computation.
`
`DETAILED DESCRIPTION
`
`1. Definitions:
`Direct execution logic (DEL)-is an assemblage of
`dynamically reconfigurable functional elements that enables
`a program to establish an optimized interconnection among
`selected functional units in order to implement a desired
`computational, data prefetch and/or data access functionality
`for maximizing the parallelism inherent in the particular
`code.
`Reconfigurable Processor-is a computing device that
`contains reconfigurable components such as FPGAs and
`can, through reconfiguration, instantiate an algorithm as
`hardware.
`Reconfigurable Logic-is composed of an interconnec(cid:173)
`tion of functional units, control, and storage that implements
`an algorithm and can be loaded into a Reconfigurable
`Processor.
`Functional Unit-is a set oflogic that performs a specific
`operation. The operation may for example be arithmetic,
`logical, control, or data movement. Functional units are used
`as building blocks of reconfigurable logic.
`Macro-is another name for a functional unit.
`Memory Hierarchy-is a collection of memories
`Data prefetch Unit-is a functional unit that moves data
`between members of a memory hierarchy. The movement
`may be as simple as a copy, or as complex as an indirect
`indexed strided copy into a unit stride memory.
`Data access Unit-is a functional unit that accesses a
`component of a memory hierarchy, and delivers data directly
`to computational logic.
`Intelligent Memory Control Unit-is a control unit that
`has the ability to select data from its storage according to a
`variety of algorithms that can be selected by a data requester,
`such as a data prefetch unit.
`Bandwidth Efficiency-is defined as the percentage of
`contributory data transferred between two points. Contribu(cid:173)
`tory data is data that actually participates in the recipients
`processing.
`Bandwidth Utilization-is defined as the percentage of
`maximum bandwidth between two points that is actually
`used to pass contributory data.
`2. Description
`A reconfigurable processor (RP) 100 implements direct
`executable logic (DEL) to perform computation, as well a
`memory hierarchy for maintaining input data and computa(cid:173)
`tional results. DEL is an assemblage of dynamically recon(cid:173)
`figurable functional elements that enables a program to
`establish an optimized interconnection among selected func(cid:173)
`tional units in order to implement a desired computational,
`data prefetch and/or data access functionality for maximiz(cid:173)
`ing the parallelism inherent in the particular code. The term
`
`6
`DEL may also be used to refer to the set of constructs such
`as code, data, configuration variables, and the like that can
`be loaded into RP 100 to cause RP 100 to implement a
`particular assemblage of functional elements.
`FIG. 1 presents an RP 100, which may be implemented
`using field programmable gate arrays (FPGAs) or other
`reconfigurable logic devices, that can be configured and
`reconfigured to contain functional units and interconnecting
`circuits, and a memory hierarchy comprising on-board
`10 memory banks 104, on-chip block RAM 106, registers
`wires, and a connection 108 to external memory. On-chip
`reconfigurable components 102 create memory structures
`such as registers, FIFOs, wires and arrays using block RAM.
`Dual-ported memory 106 is shared between on-chip recon-
`15 figurable components 102. The reconfigurable processor 100
`also implements user-defined computational logic ( e.g., such
`as DEL 200 shown in FIG. 2) constructed by programming
`an FPGA to implement a particular interconnection of
`computational functional units. In a particular implementa-
`20 tion, a number of RPs 100 are implemented within a
`memory subsystem of a conventional computer, such as on
`devices that are physically installed in dual inline memory
`module (DIMM) sockets of a computer. In this marmer the
`RPs 100 can be accessed by memory operations and so
`25 coexist well with a more conventional hardware platform. It
`should be noted that, although the exemplary implementa(cid:173)
`tion of the present invention illustrated includes six banks of
`dual ported memory 104 and two reconfigurable compo(cid:173)
`nents 102, any number of memory banks and/or reconfig-
`30 urable components may be used depending upon the par(cid:173)
`ticular implementation or application.
`Any computer program, including complex graphics pro(cid:173)
`cessing programs, word processing programs, database pro(cid:173)
`grams and the like, is a collection of algorithms that interact
`35 to implement desired functionality. In the common case in
`which static computing hardware resources are used ( e.g., a
`conventional microprocessor), the computer program is
`compiled into a set of executable code (i.e., object code)
`units that are linked together to implement the computer
`40 program on the particular hardware resources. The execut(cid:173)
`able code is generated specifically for a particular hardware
`platform. In this marmer, the computer program is adapted
`to conform to the limitations of the static hardware platform.
`However, the compilation process makes many compro-
`45 mises based on the limitations of the static hardware plat(cid:173)
`form.
`Alternatively, an algorithm can be defined in a high level
`language then compiled into DEL. DEL can be produced via
`a compiler from high level programming languages such as
`50 C or FORTRAN or may be designed using a hardware
`definition language such as Verilog, VHDL or a schematic
`capture tool. Computation is performed by reconfiguring a
`reconfigurable processor with the DEL and flowing data
`through the computation. In this marmer, the hardware
`55 resources are essentially adapted to conform to the program
`rather than the program being adapted to conform to the
`hardware resources.
`For purposes of this description a single reconfigurable
`processor will be presented first. A sample of computational
`60 logic 201 is shown in FIG. 2. This simple assemblage of
`functional units performs computation of two results ("A+
`B" and "A+B-(B*C)) from three input variables or operands
`"A", "B" and "C". In practice, computational units 201 can
`be implemented to perform very simple or arbitrarily com-
`65 plex computations. The input variables (operands) and out(cid:173)
`put or result variables may be of any size necessary for a
`particular application. Theoretically, any number of aper-
`
`Petitioners Amazon
`Ex. 1001, p. 16
`
`
`
`US 7,149,867 B2
`
`7
`ands and result variables may be used/generated by a
`particular DEL. Great complexity of computation can be
`supported by adding additional reconfigurable chips and
`processors.
`For greatest performance the DEL 200 is constructed as 5
`parallel pipelined logic blocks composed of computational
`functional units capable of taking data and producing results
`with each clock pulse. The highest possible performance that
`can be achieved is computation of a set of results with each
`clock pulse. To achieve this, data should be available at the 10
`same rate the computation can consume the data. The rate at
`which data can be supplied to DEL 200 is determined, at
`least in significant part, by the memory bandwidth utiliza(cid:173)
`tion and efficiency. Maximal computational performance
`can be achieved with parallel and pipelined DEL together 15
`with maximizing the memory bandwidth utilization and
`efficiency. Unlike conventional static hardware platforms,
`however, the memory hierarchy provided in a RP 100 is
`reconfigurable. In accordance with the present invention,
`through the use of data access units and associated memory 20
`hierarchy components, computational demands and memory
`bandwidth can be matched.
`High memory bandwidth efficiency is achieved when only
`data required for computation is moved within the memory
`hierarchy. FIG. 3 shows a simple logic block 300 comprising
`computational functional units 301, control (not shown), and
`data access functional units 303. The data access unit 303
`presents data directly to the computational logic 301. In this
`manner, data is moved from a memory