`
`(12) Unlted States Patent
`(10) Patent No.:
`US 7,149,867 B2
`
`Poznanovic et a].
`(45) Date of Patent:
`Dec. 12, 2006
`
`(54) SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`(75)
`
`Inventors: Daniel poznanovic, Colorado Springs,
`CO (US); David E. Caliga, Colorado
`Springs, CO (US); Jeflrey Hammes,
`Colorado Springs, CO (US)
`
`(73) Assignee: SRC Computers, Inc., Colorado
`Springs, CO (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 15403) by 0 days.
`
`(21) Appl. No.: 10/869,200
`
`(22)
`
`Filed:
`
`Jun. 16, 2004
`
`(65)
`
`Prior Publication Data
`
`Dec. 23, 2004
`US 2004/0260884 A1
`Related US. Application Data
`
`(51)
`
`(60) Provisional application No. 60/479,339, filed on Jun.
`18 2003.
`’
`Int. Cl.
`(2006.01)
`G06F 12/00
`(52) us. Cl.
`....................................... 711/170, 711/154
`(58) Field of Classification Search ........ 711/1707173,
`712/15
`See application file for complete search history.
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`6/2000 Huppenthal et a1.
`6 076 152 A
`6/2001 Vondran. Jr.
`................ 711/120
`6,243,791 B1*
`6/2001 Huppenthal et a1.
`6 247 110 B1
`6,460,122 B1* 10/2002 Otterness et a1.
`........... 711/154
`6,507,898 B1*
`1/2003 Gibson et a1.
`.............. 711/168
`6,594,736 B1
`6/2003 Parks
`
`6,714,041 B1*
`2003/0046492 A1*
`2003/0046530 A1*
`2003/0084244 A1*
`2003/0088737 A1*
`
`2003/0208658 A1 *
`2005/0044327 A1 *
`
`................ 326/38
`3/2004 Darling et a1.
`.. 711/118
`3/2003 Gschwind et a1.
`.
`
`3/2003 Poznanovic ........... 713/100
`
`5/2003 Paulraj ............. 711/118
`
`5/2003 Burton
`...... 711/118
`..................... 711/122
`11/2003 Magoshi
`2/2005 Howard et a1.
`............. 711/147
`
`OTHER PUBLICATIONS
`
`“Summary: The Cache Read/Write Process,” The PC Guide, 2001,
`www.pcguide.com/ref/mbsys/cache/func.htm.*
`Chien et al., “Safe and Protected Execution for the Morph/AMRM
`
`Reconfigurable Processor,” IEEE, 1999, PP' 143*
`IEEE 100: The Authoritative Dictionary of IEEE Standards Terms,
`Standards Information Network, 2000, pp. 874*
`* cited by examiner
`
`Primary Examiner4ary Portka
`Assistant Examinerishane M. Thomas
`(74) Attorney, Agent, or Firm7William J. Kubida; Michael
`C. Martensen; Hogan & Hartson LLP
`
`(57)
`
`ABSTRACT
`
`A reconfigurable processor. that includes a computational
`umt and a data prefetch un1t coupled to the computat1onal
`unit, where the data prefetch unit retrieves data from a
`memory and supplies the data to the computational unit
`through memory and a data access unit, and where the data
`PrefetCh “nits memory, and data access unit is configured by
`?‘ Program Also’ a reconfigurable hardware ”Stem that
`1ncludes a common memory; and one or more reconfig-
`urable processors coupled to the common memory, where at
`least one of the reconfigurable processors includes a data
`prefetch unit to read and write data between the unit and the
`common memory, and where the data prefetch unit
`is
`configured by a program executed on the system. In addi-
`tion, a method of transferring data that includes transferring
`data between a memory and a data PrefetCh unit
`in a
`reconfigurable processor; and transferring the data between
`a compumional unit and the data PrefetCh unit
`
`19 Claims, 12 Drawing Sheets
`
`TDUFSTREAM
`RECONFIGURAELE
`PROCESSO‘R MEMORY
`| ___________ I
`l
`
`TOUPSTREAM
`
`.i
`
` MEMORY
`
`
`
`”5
`
`
`i
`L
`It:
`H
`305
`p05
`:
`305
`:
`i
`MEMORY
`FEAORY
`MEMOHV
`T
`E W i
`
`
`
`l
`I
`
`i
`BANK A
`BANK B
`BANK c
`‘ ‘ ’
`BANK F
`:
`
`
`
`
`
`
`i
`\\------\-----—'«3
`3-1-------------/_ i
`;
`PREFEICH
`
`l
`«:01
`UNIT
`
`
`
`
`[.4
`-
`-
`STRIDED
`i
`PREFETC”
`403
`BANKF Tr‘
`
`
`
`6
`
`i
`
`
`
`
`
`
`
`
`
`BLOCK RAM
`MEMORY
`MEMORV
`BANK D
`
`
`
`
`
`
`TO DOWNSTREAM
`LOG/c
`
`1
`
`XILINX 1001
`
`XILINX 1001
`
`1
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 1 of 12
`
`US 7,149,867 B2
`
`N.OE
`
`GEEE
`
`m+<
`
`20%.3;N060..E3
`
`29:$3F0.09Em:
`
`OMHmOmJ<DQ
`
`>mO_>_m_2
`
`m¥2<mX_m
`
`
`
`awry—On.4.130
`
`Dm<Om-zO
`
`>mOS_wS_
`
`vow
`
`F.GE
`
`2
`
`
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 2 of 12
`
`US 7,149,867 B2
`
` mom
`olxz<mIS
`-E
`
`Em
`
`5m
`
`muxz<m|9
`
`mom
`
`m.OE
`
`mom
`
`$21x004m
`
`>mO_>_m=>_
`
`>mO_>_m=>_
`
`Dxz<mo
`
`mm
`
`3
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 3 of 12
`
`US 7,149,867 B2
`
`SEEK:oh
`
`m.._m<m:o_u_zoom_m
`
`
`
` _5.0sz”6350?.
`
`_______.__._________mom__—IIIIIIIIIIIII_
`
`§<mmkm13O._.
`
`O.04
`
`mom
`
`mom
`
`mom
`
`
`
`
`
`>N_O_>_m_>_>mO§m=2>mO_>_m=>_
`
`O¥Z<mmv_z<m<x2<m
`
`
`
`Iomumma:0meme
`
`m895m
`
`_.
`
`>mO_2w_>_
`
`:23
`
`mom
`
`x004m
`
`_>_<m
`
`>MO_>_m_2
`
`5.2%
`
`2<mm._.m2>>ODO...
`
`O_OO._
`
`V.OE
`
`S_<mx003.
`
`>mO—2m_>_
`
`>mO—2m2
`
`0xz<m
`
`son
`
`mom
`
`4
`
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 4 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`mmnjomhzoo>m0_>_m—>_._.Zm0_._._m._.2_
`
`
`
`>mO_>_m__>_4<me_._.Xm_
`
`
`
`>m02m=2._<ZMM._.Xm
`
`
`
`
`
`:05“.meIomummmShaman.
`
`
`
`Sofia895m895mAJr\so
`
`IOmemmm
`
`wMQWFW
`
`IO._.mn_m_mn_
`
`10$“.me
`
`_‘mDEHw
`
`EQEmLJ\Sm
`
`
`
`3mom
`
`com
`
`
`
`2,».504m$05.).
`
`E05:0xz<m8m
`
`
`
`_>_<mXOOJm
`
`>mO—2m=>_
`
`>mO_>_m__2
`
`ov.245mom
`
`e.OE
`
`mGE
`
`5
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 5 of 12
`
`US 7,149,867 B2
`
`._‘..........
`
`
`
`.........................F.
`
`zfimpmn:52men:25um0..M
`
`5m
`
`092515
`
`EM/ana8m
`
`Sms_<mxoogmE0292
`
`$0szaxz<mmom
`
`IOFMumma
`
`DQOkm
`
`10.5“.ng
`
`Dmo_m._.m
`
`:0qume
`
`89umLJ\
`
`won
`
`6
`
`
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 6 of 12
`
`US 7,149,867 B2
`
`801801801801
`
`801
`
`801
`
`805
`
`FIG.8
`
`7
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 7 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`MmmqumemZEH5.0
`
`
`
`mm.GE
`
`
`“I
`a.
`‘l’A-_-——
`'
`.
`'
`A,
`V
`l.
`v
`
`-
`
`8
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 8 of 12
`
`US 7,149,867 B2
`
`.20
`
`
`
`mmmuam.mmmmzék
`
`
`
`me..QE
`
`
`
`
`
`
`\\
`\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
`
`l'l'llll
`"'l""
`»aldr””d"”4Haldw”lld””lu”/4w”lld
`
`
`
`9
`
`
`
`U.S. Patent
`
`m.D
`
`6002H:
`
`tm
`
`1f0
`
`US 7,149,867 B2
`
`llVAlll'2h»N>>OmF>>OmAll
`s5;.GE
`.QE mmnfiDm
`
`
`9xxx
`
`10
`
`mmmmzéh5.0Lt}
`
`m:
`
`lll||\“~l|
`
`'.'.'\\\Nl|
`'ll'l“\~'l
`
`.I.|'\\\~I.
`|||I|\‘\\ll
`
`
`
`10
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 10 0f 12
`
`US 7,149,867 B2
`
`
`
`mm“.05
`
`
`
`
`
`
`
`HUM—HUDDMDHDHDUlll§§§lllll§§$llIII§$§IIDHDHDUHDHDUHDHD
`
`
`
`
`
`
`
`
`
`
`
` -»rAnawV\\\h.\\\\.‘\\\§\
`
`”(N5\\\\\V\\\L.~\\
`
`II.\\.\\\\x.\\\\.‘\\
`.\\\\.w\\\\.
`
`\\\\\L
`
`11
`
`11
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 11 of 12
`
`US 7,149,867 B2
`
`EO F
`
`CE
`LU
`LL
`LL
`2)
`CD
`DCLU
`LL
`(I)
`
`E[—
`
`1305
`
`IG.13
`
`/7
`
`1301
`
`136;/’>
`
`12
`
`12
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 12 of 12
`
`US 7,149,867 B2
`
`
`
`FIG.14
`
`1403
`
`13
`
`0:
`UJ
`u.
`u.
`D
`[I
`LU
`u.
`(0
`
`3C
`
`E:l.—
`
`i:
`L)
`
`13
`
`
`
`US 7,149,867 B2
`
`1
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`RELATED APPLICATIONS
`
`The present invention claims the benefit of US. Provi-
`sional Patent application Ser. No. 60/479,339 filed on Jun.
`18, 2003, which is incorporated herein by reference in its
`entirety.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates, in general, to enhancing the
`efliciency and utilization of memory bandwidth in reconfig-
`urable hardware. More specifically, the invention relates to
`implementing explicit memory hierarchies in reconfigurable
`processors that make eflicient use of off-board, on-board,
`on-chip storage and available algorithm locality. These
`explicit memory hierarchies avoid many of the tradeofls and
`complexities found in the traditional memory hierarchies of
`microprocessors.
`2. Relevant Background
`Over the past 30 years, microprocessors have enjoyed
`annual performance gains averaging about 50% per year.
`Most of the gains can be attributed to higher processor clock
`speeds, more memory bandwidth and increasing utilization
`of instruction level parallelism (ILP) at execution time.
`As microprocessors and other dense logic devices (DLDs)
`consume data at ever-increasing rates it becomes more of a
`challenge to design memory hierarchies that can keep up.
`Two measures of the gap between the microprocessor and
`memory hierarchy are bandwidth efliciency and bandwidth
`utilization. Bandwidth efliciency refers to the ability to
`exploit available locality in a program or algorithm. In the
`ideal situation, when there is maximum bandwidth efli-
`ciency, all available locality is utilized. Bandwidth utiliza-
`tion refers to the amount of memory bandwidth that is
`utilized during a calculation. Maximum bandwidth utiliza-
`tion occurs when all available memory bandwidth is uti-
`lized.
`
`Potential performance gains from using a faster micro-
`processor can be reduced or even negated by a correspond-
`ing drop in bandwidth efliciency and bandwidth utilization.
`Thus, there has been significant effort spent on the devel-
`opment of memory hierarchies that can maintain high band-
`width efliciency and utilization with faster microprocessors.
`One approach to improving bandwidth efliciency and
`utilization in memory hierarchies has been to develop ever
`more powerful processor caches. These caches are high-
`speed memories (typically SRAM) in close proximity to the
`microprocessor that try to keep copies of instructions and
`data the microprocessor may soon need. The microprocessor
`can store and retrieve data from the cache at a much higher
`rate than from a slower, more distant main memory.
`In designing cache memories,
`there are a number of
`considerations to take into account. One consideration is the
`
`width of the cache line. Caches are arranged in lines to help
`hide memory latency and exploit spatial locality. When a
`load suffers a cache miss, a new cache line is loaded from
`main memory into the cache. The assumption is that a
`program being executed by the microprocessor has a high
`degree of spatial locality, making it likely that other memory
`locations in the cache line will also be required.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`For programs with a high degree of spatial locality (e.g.,
`stride-one access), wide cache lines are more eflicient since
`they reduce the number of times a processor has to suffer the
`latency of a memory access. However, for programs with
`lower levels of spatial locality, or random access, narrow
`lines are best as they reduce the wasted bandwidth from the
`unused neighbors in the cache line. Caches designed with
`wide cache lines perform well with programs that have a
`high degree of spatial
`locality, but generally have poor
`gather/scatter perfonnance. Likewise, caches with short
`cache lines have good gather/scatter performance, but loose
`efliciency executing programs with high spatial
`locality
`because of the additional runs to the main memory.
`Another consideration in cache design is cache associa-
`tivity, which refers to the mapping between locations in
`main memory and cache sectors. At one extreme of cache
`associativity is a direct-mapped cache, while at another
`extreme is a fully associative cache. In a direct mapped-
`cache, a specific memory location can be mapped to only a
`single cache line. Direct-mapped caches have the advantage
`of being fast and easy to construct in logic. The disadvantage
`is that they suffer the maximum number of cache conflicts.
`At the other extreme, a fully associative cache allows a
`specific location in memory to be mapped to any cache line.
`Fully associative caches tend to be slower and more com-
`plex due to the large amount of comparison logic they need,
`but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative
`caches. A design point between the extremes is a k-set
`associative cache, where each memory location can map to
`k cache sectors. These caches generally have less overhead
`than fully associative caches, and reduce cache conflicts by
`increasing the value of k.
`Another consideration in cache design is how cache lines
`are replaced due to a capacity or conflict miss. In a direct-
`mapped cache, there is only one possible cache line that can
`be replaced due to a miss. However, in caches with higher
`levels of associativity, cache lines can be replaced in more
`that one way. The way the cache lines are replaced is
`referred to as the replacement policy.
`Options for the replacement policy include least recently
`used (LRU),
`random replacement, and first
`in-first out
`(FIFO). LRU is used in the majority of circumstances where
`the temporal locality set is smaller than the cache size, but
`it is normally more expensive to build in hardware than a
`random replacement cache. An LRU policy can also quickly
`degrade depending on the working set size. For example,
`consider an iterative application with a matrix size of N
`bytes running through a LRU cache of size M bytes. If N is
`less than M, then the policy has the desired behavior of
`100% cache hits, however, if N is only slightly larger than
`M, the LRU policy results in 0% cache hits as lines are
`removed just as they are needed.
`Another consideration is deciding on a write policy for the
`cache. Write-through caches send data through the cache
`hierarchy to main memory. This policy reduces cache coher-
`ency issues for multiple processor systems and is best suited
`for data that will not be re-read by the processor in the
`immediate future. In contrast, write-back caches place a
`copy of the data in the cache, but does not immediately
`update main memory. This type of caching works best when
`a data just written to the cache is quickly requested again by
`the processor.
`In addition to write-through and write-back caches,
`another kind of write policy is implemented in a write-
`allocate cache where a cache line is allocated on a write that
`
`misses in cache. Write-allocate caches improve performance
`
`14
`
`14
`
`
`
`US 7,149,867 B2
`
`3
`when the microprocessor exhibits a lot of write followed by
`read behavior. However, when writes are not subsequently
`read, a write-allocate cache has a number of disadvantages:
`When a cache line is allocated, it is necessary to read the
`remaining values from main memory to complete the cache
`line. This adds unnecessary memory read traffic during store
`operations. Also, when the data is not read again, potentially
`useful data in the cache is displaced by the unused data.
`Another consideration is made between the size and the
`
`speed of the cache: small caches are typically much faster
`than larger caches, but store less data and fewer instructions.
`Less data means a greater chance the cache will not have
`data the microprocessor is requesting (i.e., a cache miss)
`which can slow everything down while the data is being
`retrieved from the main memory.
`Newer cache designs reduce the frequency of cache
`misses by trying to predict in advance the data that the
`microprocessor will request. An example of this type of
`cache is one that supports speculative execution and branch
`prediction. Speculative execution allows instructions that
`likely will be executed to start early based on branch
`prediction. Results are stored in a cache called a reorder
`buffer and retired if the branch was correctly predicted. Of
`course, when mis-predictions occur instruction and data
`bandwidth are wasted.
`There are additional considerations and tradeolfs in cache
`
`design, but it should be apparent from the considerations
`described hereinbefore that it is very difficult to design a
`single cache structure that is optimized for many difierent
`programs. This makes cache design particularly challenging
`for a multipurpose microprocessor that executes a wide
`variety of programs. Cache designers try to derive the
`program behavior of “average” program constructed from
`several actual programs that run on the microprocessor. The
`cache is optimized for the average program, but no actual
`program behaves exactly like the average program. As a
`result, the designed cache ends up being sub-optimal for
`nearly every program actually executed by the micropro-
`cessor. Thus, there is a need for memory hierarchies that
`have data storage and retrieval characteristics that are opti-
`mized for actual programs executed by a processor.
`Designers trying to develop ever more efficient caches
`optimized for a variety of actual programs also face another
`problem: as caches add additional features, the overhead
`needed to implement the added features also grows. Caches
`today have so much overhead that microprocessor perfor-
`mance may be reaching a point of diminishing returns as the
`overhead starts to cut into performance. In the Intel Pentium
`III processor for example, more than half of the 10 million
`transistors are dedicated to instruction cache, branch pre-
`diction, out-of-order execution and superscalar logic. The
`situation has prompted predictions that as microprocessors
`grow to a billion transistors per chip, performance increases
`will drop to about 20% per year. Such a prediction, if borne
`out, could have a significant impact on technology growth
`and the computer business.
`Thus,
`there is a growing need to develop improved
`memory hierarchies that limit the overhead of a memory
`hierarchy without also reducing bandwidth efficiency and
`utilization.
`
`SUMMARY OF THE INVENTION
`
`Accordingly, an embodiment of the invention includes a
`reconfigurable processor that includes a computational unit
`and a data access unit coupled to the computational Imit,
`where the data access unit retrieves data from an on-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`
`processor memory and supplies the data to the computa-
`tional unit, and where the computational unit and the data
`access unit are configured by a program.
`The present
`invention also involves a reconfigurable
`processor that includes a first memory of a first type and a
`data prefetch unit coupled to the memory, where the data
`prefetch unit retrieves data from a second memory of a
`second type different from the first type, and the first and
`second memory types and the data prefetch unit are config-
`ured by a program.
`Another embodiment of the invention includes a recon-
`
`figurable hardware system that includes a common memory,
`also referred to as external memory, and one or more
`reconfigurable processors coupled to the common memory,
`where at least one of the reconfigurable processors includes
`a data prefetch unit to read and write data between the unit
`and the common memory, and where the data prefetch unit
`is configured by a program executed on the system.
`Another embodiment of the invention includes a method
`
`of transferring data that includes transferring data between a
`memory and a data prefetch unit in a reconfigurable pro-
`cessor, transferring data between the prefetch unit and a data
`access unit, and transferring the data between a computa-
`tional unit and the data access unit, where the computational
`unit, data access unit and the data prefetch unit are config-
`ured by a program.
`Additional embodiments of the invention are set forth in
`
`part in the description that follows, and in part will become
`apparent to those skilled in the art upon examination of the
`following specification, or may be learned by the practice of
`the invention. The advantages of the invention may be
`realized and attained by means of the instrumentalities,
`combinations,
`compositions,
`and methods particularly
`pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 shows a reconfigurable processor in which the
`present invention may be implemented;
`FIG. 2 shows computational logic as might be loaded into
`a reconfigurable processor;
`FIG. 3 shows a reconfigurable processor as in FIG. 1, but
`with the addition of data access units;
`FIG. 4 shows a reconfigurable processor as in FIG. 3, but
`with the addition of data prefetch units;
`FIG. 5 shows reconfigurable processor with the inclusion
`of external memory;
`FIG. 6 shows reconfigurable processors with external
`memory and with an intelligent memory controller;
`FIG. 7 shows a reconfigurable processor having a com-
`bination of data prefetch units and data access units feeding
`computational logic;
`FIG. 8 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`references;
`FIG. 9A and FIG. 9B show the bandwidth efficiency and
`utilization gains obtained when utilizing a data prefetch unit
`and an intelligent memory controller to perform subset
`memory references in X-Y plane;
`FIG. 10A and FIG. 10B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset
`memory references in X-Z plane;
`FIG. 11A and FIG. 11B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`
`15
`
`15
`
`
`
`US 7,149,867 B2
`
`5
`unit and an intelligent memory controller to perform subset
`memory references in Y—Z plane;
`FIG. 12A and FIG. 12B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset
`memory references in a mini-cube;
`FIG. 13 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform indirect memory
`references ;
`FIG. 14 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`reference together with computation.
`
`DETAILED DESCRIPTION
`
`1. Definitions:
`
`Direct execution logic (DEL)7is an assemblage of
`dynamically reconfigurable functional elements that enables
`a program to establish an optimized interconnection among
`selected functional units in order to implement a desired
`computational, data prefetch and/or data access functionality
`for maximizing the parallelism inherent in the particular
`code.
`
`Reconfigurable Processoriis a computing device that
`contains reconfigurable components such as FPGAs and
`can,
`through reconfiguration,
`instantiate an algorithm as
`hardware.
`
`Reconfigurable Logiciis composed of an interc01mec-
`tion of functional units, control, and storage that implements
`an algorithm and can be loaded into a Reconfigurable
`Processor.
`
`Functional Unitiis a set of logic that performs a specific
`operation. The operation may for example be arithmetic,
`logical, control, or data movement. Functional units are used
`as building blocks of reconfigurable logic.
`Macroiis another name for a functional unit.
`
`Memory Hierarchyiis a collection of memories
`Data prefetch Unitiis a functional unit that moves data
`between members of a memory hierarchy. The movement
`may be as simple as a copy, or as complex as an indirect
`indexed strided copy into a unit stride memory.
`Data access Unitiis a functional unit that accesses a
`
`component of a memory hierarchy, and delivers data directly
`to computational logic.
`Intelligent Memory Control Unitiis a control unit that
`has the ability to select data from its storage according to a
`variety of algorithms that can be selected by a data requester,
`such as a data prefetch unit.
`Bandwidth Efficiencyiis defined as the percentage of
`contributory data transferred between two points. Contribu-
`tory data is data that actually participates in the recipients
`processing.
`Bandwidth Utilizationiis defined as the percentage of
`maximum bandwidth between two points that is actually
`used to pass contributory data.
`2. Description
`A reconfigurable processor (RP) 100 implements direct
`executable logic (DEL) to perform computation, as well a
`memory hierarchy for maintaining input data and computa-
`tional results. DEL is an assemblage of dynamically recon-
`figurable functional elements that enables a program to
`establish an optimized interconnection among selected func-
`tional units in order to implement a desired computational,
`data prefetch and/or data access functionality for maximiz-
`ing the parallelism inherent in the particular code. The term
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`DEL may also be used to refer to the set of constructs such
`as code, data, configuration variables, and the like that can
`be loaded into RP 100 to cause RP 100 to implement a
`particular assemblage of functional elements.
`FIG. 1 presents an RP 100, which may be implemented
`using field programmable gate arrays (FPGAs) or other
`reconfigurable logic devices, that can be configured and
`reconfigured to contain filnctional units and interconnecting
`circuits, and a memory hierarchy comprising on-board
`memory banks 104, on—chip block RAM 106, registers
`wires, and a connection 108 to external memory. On—chip
`reconfigurable components 102 create memory structures
`such as registers, FIFOs, wires and arrays using block RAM.
`Dual-ported memory 106 is shared between on-chip recon-
`figurable components 102. The reconfigurable processor 100
`also implements user-defined computational logic (e.g., such
`as DEL 200 shown in FIG. 2) constructed by programming
`an FPGA to implement a particular interconnection of
`computational functional units. In a particular implementa-
`tion, a number of RPs 100 are implemented within a
`memory subsystem of a conventional computer, such as on
`devices that are physically installed in dual inline memory
`module (DIMM) sockets of a computer. In this manner the
`RPs 100 can be accessed by memory operations and so
`coexist well with a more conventional hardware platform. It
`should be noted that, although the exemplary implementa-
`tion of the present invention illustrated includes six banks of
`dual ported memory 104 and two reconfigurable compo-
`nents 102, any number of memory banks and/or reconfig-
`urable components may be used depending upon the par-
`ticular implementation or application.
`Any computer program, including complex graphics pro-
`cessing programs, word processing programs, database pro-
`grams and the like, is a collection of algorithms that interact
`to implement desired functionality. In the common case in
`which static computing hardware resources are used (e.g., a
`conventional microprocessor),
`the computer program is
`compiled into a set of executable code (i.e., object code)
`units that are linked together to implement the computer
`program on the particular hardware resources. The execut-
`able code is generated specifically for a particular hardware
`platform. In this manner, the computer program is adapted
`to conform to the limitations of the static hardware platform.
`However, the compilation process makes many compro-
`mises based on the limitations of the static hardware plat-
`form.
`
`Alternatively, an algorithm can be defined in a high level
`language then compiled into DEL. DEL can be produced via
`a compiler from high level programming languages such as
`C or FORTRAN or may be designed using a hardware
`definition language such as Verilog, VHDL or a schematic
`capture tool. Computation is performed by reconfiguring a
`reconfigurable processor with the DEL and flowing data
`through the computation.
`In this manner,
`the hardware
`resources are essentially adapted to conform to the program
`rather than the program being adapted to conform to the
`hardware resources.
`
`For purposes of this description a single reconfigurable
`processor will be presented first. A sample of computational
`logic 201 is shown in FIG. 2. This simple assemblage of
`functional units performs computation of two results (“A+
`B” and “A+B—(B*C)) from three input variables or operands
`“A”, “B” and “C”. In practice, computational units 201 can
`be implemented to perform very simple or arbitrarily com-
`plex computations. The input variables (operands) and out-
`put or result variables may be of any size necessary for a
`particular application. Theoretically, any number of oper-
`
`16
`
`16
`
`
`
`US 7,149,867 B2
`
`7
`ands and result variables may be used/generated by a
`particular DEL. Great complexity of computation can be
`supported by adding additional reconfigurable chips and
`processors.
`
`For greatest performance the DEL 200 is constructed as
`parallel pipelined logic blocks composed of computational
`functional units capable of taking data and producing results
`with each clock pulse. The highest possible performance that
`can be achieved is computation of a set of results with each
`clock pulse. To achieve this, data should be available at the
`same rate the computation can consume the data. The rate at
`which data can be supplied to DEL 200 is determined, at
`least in significant part, by the memory bandwidth utiliza-
`tion and efficiency. Maximal computational performance
`can be achieved with parallel and pipelined DEL together
`with maximizing the memory bandwidth utilization and
`efficiency. Unlike conventional static hardware platforms,
`however, the memory hierarchy provided in a RP 100 is
`reconfigurable. In accordance with the present invention,
`through the use of data access units and associated memory
`hierarchy components, computational demands and memory
`bandwidth can be matched.
`
`High memory bandwidth efficiency is achieved when only
`data required for computation is moved within the memory
`hierarchy. FIG. 3 shows a simple logic block 300 comprising
`computational functional units 301, control (not shown), and
`data access functional units 303. The data access unit 303
`
`presents data directly to the computational logic 301. In this
`manner, data is moved from a memory device 305 to the
`computational logic and from the computational logic back
`into a memory device 305 or block RAM memory 307
`within an RP 100.
`
`FIG. 4 illustrates the logic block 300 with an addition of
`a data prefetch unit 401. The data prefetch unit 401 moves
`data from one member of the memory hierarchy 305 to
`another 308. Data prefetch unit 401 operates independently
`of other functional units 301, 302 and 303 and can therefore
`operate prior to,
`in parallel with, or after computational
`logic. This independence of operation permits hiding the
`latency associated with obtaining data for use in computa-
`tion. The data prefetch unit deposits data into the memory
`hierarchy within RP 100, where computational logic 301,
`302 and 303 can access it through data access units. In the
`example of FIG. 4, prefetch unit 401 is configured to deposit
`data into block RAM memory 308. Hence, the prefetch units
`401 may be operated independently of logic block 300 that
`uses prefetched data.
`An important feature of the present invention is that many
`types of data prefetch units can be defined so that
`the
`prefetch hardware can be configured to conform to the needs
`of the algorithms currently implemented by the computa-
`tional logic. The specific characteristics of the prefetch can
`be matched with the needs of the computational logic and
`the format and location of data in the memory hierarchy. For
`example, FIG. 9A and FIG. 9B show an external memory
`that is organized in a 128 byte (16 word) block structure.
`This organization is optimized for stride 1 access of cache
`based computers. A stride 128 access can result in a very
`inefficient use of bandwidth from the memory, since an extra
`120 bytes of data is moved for every 8 bytes of requested
`data yielding a 6.25% bandwidth efficiency.
`FIG. 5 shows an example of data prefetch in which there
`are no bandwidth gains since all data fetched from external
`memory blocks is also transferred and used in computational
`units 301 through memory bank access units 303. However,
`bandwidth utilization is increased due to the ability of the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`data prefetch units 501 to initiate a data transfer in advance
`of the requirement for data by computational logic.
`In accordance with an embodiment of the present inven-
`tion, data prefetch units 601 are configured to communicate
`with an intelligent memory controller 603 in FIG. 6 and can
`extract only the desired 8 bytes of data, discard the remain-
`der of the memory block, and transmit to the data prefetch
`unit only the requested portion of the stride 128 data. The
`prefetch units 601 then delivers that data to the appropriate
`memory components within the memory hierarchy of the
`logic block 300.
`FIG. 6 shows the prefetch units 601 delivering data to the
`RP’s onboard memory banks 305. An onboard memory bank
`data access unit 303 then delivers the data to computational
`logic 301 when required. The data prefetch units 501 couple
`with an intelligent memory controller 601 in the implemen-
`tation of FIG. 6 that supports a strided reference pattern,
`which yields a 100% bandwidth efficiency in contrast to the
`6.25% efficiency. Although illustrated as a single block of
`external memory, multiple numbers of external memories
`may be employed as well.
`In FIG. 7, the combination of data prefetch units 701 and
`data access units 703 feeding computational logic 301 such
`that bandwidth efficiency and utilization are maximized is
`shown in FIG. 7. In this example strided data prefetch units
`701 fetch only the required data words from external
`memory. FIG. 8 demonstrates the efficiency gains enabled
`by this combination. Prefetch units 701 deliver the data into
`stream memory components 705 that is accessed by stream
`data access units 703. The stream data access units 703 fet