`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 1 of 22 Page|D# 64
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`EXHIBIT B
`EXHIBIT B
`
`
`
`
`
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 2 of 22 PageID# 65
`Case 2:17-CV-00547-RAJ-RJK Document Hm "ll'lmlflnlfllmmflllll mmlmllflffllmuflfllmw 65
`
`US007149867B2
`
`(12) United States Patent
`US 7,149,867 B2
`(10) Patent No.:
`Poznanovic et al.
`(45) Date of Patent:
`Dec. 12, 2006
`
`(54)
`
`(75)
`
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`Inventors: Daniel Poznanovic, Colorado Springs,
`CO (US); David E. Caliga, Colorado
`Springs, CO (US); Jeflrey Hammes,
`Colorado Springs, CO (US)
`
`(73)
`
`Asgignee: SRC Computers, Inc., Colorado
`Springs, CO (US)
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U S C 154(b) by 0 days
`
`(21)
`
`Appl. No.: 10/869,200
`
`(22)
`
`Filed:
`
`Jun. 16, 2004
`
`Prior Publication Data
`
`Dec. 23, 2004
`US 2004/0260884 A1
`Related US. Application Data
`
`6,714,041 B1*
`2003/0046492 A1*
`2003/0046530 A1*
`2003/0084244 A1*
`2003/0088737 A1 *
`
`................ 326/38
`3/2004 Darling et a1.
`3/2003 Gschwind et al.
`.
`711/118
`
`3/2003 Poznanovic ................. 713/100
`5/2003 Paulraj
`....................... 7ll/ll8
`5/2003 Burton
`...... 7ll/ll8
`
`
`2003/0208658 A1* 11/2003 Magoshi .............. 711/122
`2005/0044327 A1 *
`2/2005 Howard et a1.
`............. 711/147
`
`OTHER PUBLICATIONS
`
`“Summary: The Cache Read/Write Process,” The PC Guide, 2001,
`www.pcguide.com/ref/mbsys/cache/func.htm.*
`Chien et al., “Safe and Protected Execution for the Morph/AMRM
`
`Reconfigurable Processor,” IEEE, 1999, PP' 143*
`IEEE 100: The Authoritative Dictionary of IEEE Standards Terms,
`Standards Information Network, 2000, pp. 874*
`* cited by examiner
`
`Primary Examiner4ary Portka
`Assistant Examinerishane M. Thomas
`(74) Attorney, Agent, or Firm7William J. Kubida; Michael
`C. Martensen; Hogan & Hartson LLP
`
`(57)
`
`ABSTRACT
`
`A reconfigurable processor that includes a computational
`unit and a data prefetch unit coupled to the computational
`unit, where the data prefetch unit retrieves data from a
`memory and supplies the data to the computational unit
`through memory and a data access unit, and where the data
`prefetch unit, memory, and data access unit is configured by
`a program. Also, a reconfigurable hardware system that
`includes a common memory; and one or more reconfig-
`urable processors coupled to the common memory, where at
`least one of the reconfigurable processors includes a data
`prefetch unit to read and write data between the unit and the
`13’
`P
`common memo , and where the data
`refetch unit
`is
`configured by a program executed on the system. In addi-
`tion, a method of transferring data that includes transferring
`data between a memory and a data PrefetCh unit
`in a
`reconfigurable processor; and transferring the data between
`a compumional unit and the data PrefetCh unit
`
`19 Claims, 12 Drawing Sheets
`
`(65)
`
`(60)
`
`(51)
`
`(52)
`(58)
`
`(56)
`
`.
`
`Provisional application No 60/479 339 filed on Jun
`18 2003
`.
`’
`’
`’
`Int. Cl.
`G06F 12/00
`U S Cl
`Field of Classification Search
`
`.
`
`(2006.01)
`
`711/170. 711/154
`711/1707173.
`712/15’
`""""
`See application file for complete search history
`References Cited
`
`'
`
`U'S' PATENT DOCUMENTS
`6 076 152 A
`6/2000 Huppenthal et a1.
`6,243,791 B1*
`6/2001 Vondran, Jr.
`................ 711/120
`6 247 110 B1
`6/2001 Huppenthal et a1.
`
`........... 711/154
`6,460,122 B1* 10/2002 Otterness et a1.
`6,507,898 B1*
`1/2003 Gibson et a1.
`.............. 711/168
`6,594,736 B1
`6/2003 Parks
`
`RSé’oLlifémra
`PROCESSO‘R MEMORY
`
`E_________3;_'E
`:5
`305
`:55
`TO LLPSTIEEAM
`i
`
`E
`MEMORY
`MEMORY
`FEAORY
`MEMOHV
`T
`i
`i
`
`
`
`
`
`
`BANK A
`BANK B
`BANK c
`‘ A ‘
`BANK F
`:
`1
`
`
`
`
`
`
`\--------
`.
`1
`1
`
`
`_
`_
`_
`NK_B
`LD_BANK_c
`1
`PREFETCH
`i
`
`
`
`i\"“°‘
`4113
`i——i J
`i
`I ma?
`FRET-PH
`‘
`4 A»
`1
`1
`
`
`:
`302
`MEMORY
`l
`suB
`
`
`L1
`1
`1
`1°“ 1
`
`
`
`
`jinn D yam
`:
`l
`MEMORY
`BLOCK RAM
`MEMORV
`BANK D
`
`
`
`
`
` TO DOWNSTREAM
`LOG/C
`
`
`
`
`
`e
`e
`mem
`aP
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 3 of 22 PageID# 66
`5
`C
`m
`M
`%
`
`29421f9
`
`
`
`O7......................::......I1II1.,.........I:...................:....:.....:.....................:.......:..........:.........:.1.::::::..:...3S.............eUgm
`
`
`
`2SeoaU
`
`Onomwa
`
`7P.....................................................................................................................................................................................
`
`t
`
`D.“\J\2:
`
`U6
`
`2h1..S
`
`HH20%7mg29:$3%8.,mv-m+<m+<N050..mum:F0.09mmw:
`
`
`
`R.m
`
`Jcmxz<mx_m
`
`c0m0m>mo_>_m_s_Da928-202:Kn8E9.inc
`
`$0.2m:&0553,39lm/wm.nN.@EdM
`W.
`
`
`
`g7
`
`#2mB
`
`a%.P,FGE
`
`
`
`
`
`C
`e
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 4 of 22 PageID# 67
`5
`m
`ARM
`76#
`C
`2
`
`2umK1mIIIInJcmoxz<mn:mxz<mn:mRmmwJ.................................................................................
`
`m6
`
`en28m_08m522M
`
`
`
`2SeomU
`
`t
`
`nmewm
`
`
`
`7.Pm8m8mom
`
`Fa
`
`fPommmnW
`
`mn
`
`
`
`a8m3%an992.315M1u
`
`241.m7,
`
`mm
`
`pm
`
`7iw6mGE
`
`
`
`
`
`4San@0szoxz<mm.U2,».x003Eosms.
`
`mom
`
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 5 of 22 PageID# 68
`C
`86
`
`m._m<m:o_u_zoom_mZS.25mg:ohe0mU
`mOmmmoomn.W.
`
`
`#65:.E055.M__%mn_mom0.91_0wm__IIIIIIIIIIIII_.2552:OFmom8mmom_n.”DamEozms.
`mmIowmflwmam895mD2,:0memeM2mK1..............................................................................L"mw___D__M__oxz<mmxz<m<xz<mRn“>105:
`_.7_m1_u/u00_un_mommm.n_mf__wo"E052wH3nFE052_>_<mmw“50$98m.2h_w1..S__WW._mmmxz<mu
`
`1
`
`E05529xz<mm.U_52mx003E052a_P_
`.m7,mom5S_1IIIIIIIIIIII
`
`
`
`Pl}aw.
`
`#2DB
`
`d7060..gIa%EfimkngooO...VExt
`
`
`
`
`e
`tne
`QaP71/8l/
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 6 of 22 PageID# 6996
`ARM
`C
`2
`g
`
`2Mf9076SeU
`
`a%P9,
`
`#2mB
`
`7
`
`%momwa7P2Se0aU
`
`m6w0omDa2K1.ma.I.vD
`
`021.1dfmA0.Hae2h1S
`
`
`
`
`
`mommijNono>m0_>_m—>_._.Zm0_._._m._.z_
`
`
`
`>mO_>_m=>_4<ZMmFXm
`
`
`
`>m02m=2._<ZMM._.Xm
`
`
`
`
`
`092519muxz<m|9<|xz<m|9
`
`llll
`
`mom
`
`.
`
`.....____
`
`>MOSES.
`
`0v_Z<m
`
`
`
`E052>105:
`
`mxz<m<xz<mLJr\mom
`
`IUPmmwwE
`
`Dwgmkw
`
`IOqummm
`
`IOHmmm—mn.
`
`DQOhw
`
`F895mLJr\8
`
`
`
`IOmemmmIOHmnm—mn.
`
`wMDEFW_‘QOhw
`
`IOHmmwwE
`
`FwD_m._.m
`
`....................................................................................
`
`
`
`nowEdi¥OOJm
`
`>m0_>_m_2
`
`SEEx004m
`
`>MO—2w2
`
`>m0_>_m__2
`
`0v_Z<m_
`
`mom
`
`mGE
`
`
`
`
`
`
`
`
`
`
`
`e
`e
`tne
`aP71/8l/
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 7 of 22 PageID# 70O7
`ARM
`C
`2
`
`%momwa7P2Se0aU
`
`m6w0omDa2K1.J%mD
`
`O211dfmmHae2h1S
`
`QU
`
`2Mf9O77S
`
`gma8P9,
`
`#2mB
`
`IOFMumma
`
`DQOkm
`
`10.5“.ng
`
`Dmo_m._.m
`
`55me
`
`89umAJ\
`
`won
`
`255m265m56
`
`9:
`
`
`
`
`
`_>_<wm._.mlD.—Sammkwln:_>_<mm.PWIon.
`
`532oo<FL0:
`
`EM/ana8a
`
`5m
`
`mamm
`
`ouxz<m|5
`
`
`
`Sm2<mxoogm>mo2m2
`
`$022axz<mmom
`
`n.OI
`
`
`
`
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 8 of 22 PageID# 71
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 8 of 22 Page|D# 71
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 6 of 12
`
`US 7,149,867 B2
`
`801801801801
`
`801
`
`801
`
`805
`
`FIG.8
`
`
`
`e
`tne
`aP71/8l/
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 9 of 22 PageID# 7227
`5
`ARM
`C
`2
`
`m6w0omDa2K1.J%mD
`
`021.1dfmnHae2h1S
`
`Onomwa7P2SeoaU
`
`t
`
`2Mf9079Sm.U
`
`gma8P9,
`
`#2mB
`
`”w
`I.
`13:.
`_.at_.av
`
`mm.GE
`
`
`
`Mmmqumun—mZd‘w—H5.0
`
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 10 of 22 PageID# 73
`Case 2:17-cv-00547—RAJ-RJK Document 1-2 Filed 10/18/17 Page 10 of 22 Page|D# 73
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 8 of 12
`
`US 7,149,867 B2
`
`.20
`
`
`\\
`\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
`
`l..|llll
`"'l""
`»a’dr””d"”4naldw’llld’flf’luplllw’I’IC
`l."l.lll
`
`
`
`NEuEDm.mmmmzék
`
`me.GE
`
`
`
`
`6
`a
`mm
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 11 of 22 PageID# 74
`J
`R
`47
`C
`0
`
`e0xU
`
`%n
`
`011fd0M9xxxFam
`
`A_mt_
`D2,_Ja_RD_
`
`7.I.||\\\\I'1._I'Il|i'l|l'1u“»\\wl'l|'lZSlll||\\\\l|
`l|'|"\\\'lp.“P___|I|Ihl"u«nfi
`
`
`17«Elam«whiz/Eh.20lm¢
`WLnlav__-.|"'\\\\I|
`2I1.s<:GE
`
`&AllllVAlivA2N26”.F>>om
`mun”.\.Vnkl_l&l_
`e_
`
`
`
`
`K1
`
`uMC002
`
`2924,m1,
`
`m2dB
`
`Q7.
`pMm:wt
`
`
`
`S
`C
`1tnm
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 12 of 22 PageID# 75
`57
`Qm
`C
`2
`d
`d
`
`lSeU
`
`mM,
`
`
`
` 7.P1.2SeOaU
`
`0mwa
`
`mlhlr\;.‘\.
`
`
`
`ALAN!3:.‘\\§\4t1.1..\\\\xu\\\\.‘\\\%nkawwawwwnwx
`
`uMC002D2.,1mcemD
`
`mmHam2S
`
`1M
`
`fl2gamma.01$52
`
`
`
`
`
`
`7,MDHDHDHDlllxfixllIll$§$lllll§§$llHDHDUHDBHDHUHDUHHDUHHUI—
`
`mm
`
`g7.a6mm“0EP8929
`
`
`
`
`
`
`
`
`
`27DUNN—HUMEDHDHDHDlll§§$lllll§§$llIII§$§IIDHDHDHDUHDUHDHD
`
`
`
`
`
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 13 of 22 PageID# 76
`Case 2:17-cv-00547—RAJ-RJK Document 1-2 Filed 10/18/17 Page 13 of 22 Page|D# 76
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 11 of 12
`
`US 7,149,867 B2
`
`EO F
`
`CE
`LU
`LL
`LL
`2)
`CD
`DC
`LU
`LL
`(I)
`
`E[—
`
`1305
`
`IG.13
`
`/7
`
`1301
`
`1303/r
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 14 of 22 PageID# 77
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 14 of 22 Page|D# 77
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 12 of 12
`
`US 7,149,867 B2
`
`0:
`UJ
`u.
`u.
`
`E:l.—
`
`EE
`
`1403
`
`FIG.14
`
`0
`
`D[
`
`DD
`
`:
`LU
`u.
`(0
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 15 of 22 PageID# 78
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 15 of 22 PageID# 78
`
`US 7,149,867 B2
`
`1
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`RELATED APPLICATIONS
`
`The present invention claims the benefit of US. Provi-
`sional Patent application Ser. No. 60/479,339 filed on Jun.
`18, 2003, which is incorporated herein by reference in its
`entirety.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates, in general, to enhancing the
`efliciency and utilization of memory bandwidth in reconfig-
`urable hardware. More specifically, the invention relates to
`implementing explicit memory hierarchies in reconfigurable
`processors that make eflicient use of olf—board, on-board,
`on-chip storage and available algorithm locality. These
`explicit memory hierarchies avoid many of the tradeolfs and
`complexities found in the traditional memory hierarchies of
`microprocessors.
`2. Relevant Background
`Over the past 30 years, microprocessors have enjoyed
`annual performance gains averaging about 50% per year.
`Most of the gains can be attributed to higher processor clock
`speeds, more memory bandwidth and increasing utilization
`of instruction level parallelism (ILP) at execution time.
`As microprocessors and other dense logic devices (DLDs)
`consume data at ever-increasing rates it becomes more of a
`challenge to design memory hierarchies that can keep up.
`Two measures of the gap between the microprocessor and
`memory hierarchy are bandwidth efliciency and bandwidth
`utilization. Bandwidth efliciency refers to the ability to
`exploit available locality in a program or algorithm. In the
`ideal situation, when there is maximum bandwidth efli-
`ciency, all available locality is utilized. Bandwidth utiliza-
`tion refers to the amount of memory bandwidth that is
`utilized during a calculation. Maximum bandwidth utiliza-
`tion occurs when all available memory bandwidth is uti-
`lized.
`
`Potential performance gains from using a faster micro-
`processor can be reduced or even negated by a correspond-
`ing drop in bandwidth efliciency and bandwidth utilization.
`Thus, there has been significant effort spent on the devel-
`opment of memory hierarchies that can maintain high band-
`width efliciency and utilization with faster microprocessors.
`One approach to improving bandwidth efliciency and
`utilization in memory hierarchies has been to develop ever
`more powerful processor caches. These caches are high-
`speed memories (typically SRAM) in close proximity to the
`microprocessor that try to keep copies of instructions and
`data the microprocessor may soon need. The microprocessor
`can store and retrieve data from the cache at a much higher
`rate than from a slower, more distant main memory.
`In designing cache memories,
`there are a number of
`considerations to take into account. One consideration is the
`
`width of the cache line. Caches are arranged in lines to help
`hide memory latency and exploit spatial locality. When a
`load suffers a cache miss, a new cache line is loaded from
`main memory into the cache. The assumption is that a
`program being executed by the microprocessor has a high
`degree of spatial locality, making it likely that other memory
`locations in the cache line will also be required.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`For programs with a high degree of spatial locality (e.g.,
`stride-one access), wide cache lines are more eflicient since
`they reduce the number of times a processor has to suffer the
`latency of a memory access. However, for programs with
`lower levels of spatial locality, or random access, narrow
`lines are best as they reduce the wasted bandwidth from the
`unused neighbors in the cache line. Caches designed with
`wide cache lines perform well with programs that have a
`high degree of spatial
`locality, but generally have poor
`gather/scatter performance. Likewise, caches with short
`cache lines have good gather/scatter performance, but loose
`efliciency executing programs with high spatial
`locality
`because of the additional runs to the main memory.
`Another consideration in cache design is cache associa-
`tivity, which refers to the mapping between locations in
`main memory and cache sectors. At one extreme of cache
`associativity is a direct-mapped cache, while at another
`extreme is a fully associative cache. In a direct mapped-
`cache, a specific memory location can be mapped to only a
`single cache line. Direct-mapped caches have the advantage
`of being fast and easy to construct in logic. The disadvantage
`is that they sulfer the maximum number of cache conflicts.
`At the other extreme, a fully associative cache allows a
`specific location in memory to be mapped to any cache line.
`Fully associative caches tend to be slower and more com-
`plex due to the large amount of comparison logic they need,
`but sulfer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative
`caches. A design point between the extremes is a k-set
`associative cache, where each memory location can map to
`k cache sectors. These caches generally have less overhead
`than fully associative caches, and reduce cache conflicts by
`increasing the value of k.
`Another consideration in cache design is how cache lines
`are replaced due to a capacity or conflict miss. In a direct-
`mapped cache, there is only one possible cache line that can
`be replaced due to a miss. However, in caches with higher
`levels of associativity, cache lines can be replaced in more
`that one way. The way the cache lines are replaced is
`referred to as the replacement policy.
`Options for the replacement policy include least recently
`used (LRU),
`random replacement, and first
`in-first out
`(FIFO). LRU is used in the majority of circumstances where
`the temporal locality set is smaller than the cache size, but
`it is normally more expensive to build in hardware than a
`random replacement cache. An LRU policy can also quickly
`degrade depending on the working set size. For example,
`consider an iterative application with a matrix size of N
`bytes running through a LRU cache of size M bytes. If N is
`less than M, then the policy has the desired behavior of
`100% cache hits, however, if N is only slightly larger than
`M, the LRU policy results in 0% cache hits as lines are
`removed just as they are needed.
`Another consideration is deciding on a write policy for the
`cache. Write-through caches send data through the cache
`hierarchy to main memory. This policy reduces cache coher-
`ency issues for multiple processor systems and is best suited
`for data that will not be re-read by the processor in the
`immediate future. In contrast, write-back caches place a
`copy of the data in the cache, but does not immediately
`update main memory. This type of caching works best when
`a data just written to the cache is quickly requested again by
`the processor.
`In addition to write-through and write-back caches,
`another kind of write policy is implemented in a write-
`allocate cache where a cache line is allocated on a write that
`
`misses in cache. Write-allocate caches improve performance
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 16 of 22 PageID# 79
`Case 2:17-cv-00547—RAJ-RJK Document 1-2 Filed 10/18/17 Page 16 of 22 PageID# 79
`
`US 7,149,867 B2
`
`3
`when the microprocessor exhibits a lot of write followed by
`read behavior. However, when writes are not subsequently
`read, a write-allocate cache has a number of disadvantages:
`When a cache line is allocated, it is necessary to read the
`remaining values from main memory to complete the cache
`line. This adds unnecessary memory read traffic during store
`operations. Also, when the data is not read again, potentially
`useful data in the cache is displaced by the unused data.
`Another consideration is made between the size and the
`
`speed of the cache: small caches are typically much faster
`than larger caches, but store less data and fewer instructions.
`Less data means a greater chance the cache will not have
`data the microprocessor is requesting (i.e., a cache miss)
`which can slow everything down while the data is being
`retrieved from the main memory.
`Newer cache designs reduce the frequency of cache
`misses by trying to predict in advance the data that the
`microprocessor will request. An example of this type of
`cache is one that supports speculative execution and branch
`prediction. Speculative execution allows instructions that
`likely will be executed to start early based on branch
`prediction. Results are stored in a cache called a reorder
`bulfer and retired if the branch was correctly predicted. Of
`course, when mis-predictions occur instruction and data
`bandwidth are wasted.
`There are additional considerations and tradeolfs in cache
`
`design, but it should be apparent from the considerations
`described hereinbefore that it is very difficult to design a
`single cache structure that is optimized for many different
`programs. This makes cache design particularly challenging
`for a multipurpose microprocessor that executes a wide
`variety of programs. Cache designers try to derive the
`program behavior of “average” program constructed from
`several actual programs that run on the microprocessor. The
`cache is optimized for the average program, but no actual
`program behaves exactly like the average program. As a
`result, the designed cache ends up being sub-optimal for
`nearly every program actually executed by the micropro-
`cessor. Thus, there is a need for memory hierarchies that
`have data storage and retrieval characteristics that are opti-
`mized for actual programs executed by a processor.
`Designers trying to develop ever more efficient caches
`optimized for a variety of actual programs also face another
`problem: as caches add additional features, the overhead
`needed to implement the added features also grows. Caches
`today have so much overhead that microprocessor perfor-
`mance may be reaching a point of diminishing returns as the
`overhead starts to cut into performance. In the Intel Pentium
`III processor for example, more than half of the 10 million
`transistors are dedicated to instruction cache, branch pre-
`diction, out-of-order execution and superscalar logic. The
`situation has prompted predictions that as microprocessors
`grow to a billion transistors per chip, performance increases
`will drop to about 20% per year. Such a prediction, if borne
`out, could have a significant impact on technology growth
`and the computer business.
`Thus,
`there is a growing need to develop improved
`memory hierarchies that limit the overhead of a memory
`hierarchy without also reducing bandwidth efficiency and
`utilization.
`
`SUMMARY OF THE INVENTION
`
`Accordingly, an embodiment of the invention includes a
`reconfigurable processor that includes a computational unit
`and a data access unit coupled to the computational unit,
`where the data access unit retrieves data from an on-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`
`processor memory and supplies the data to the computa-
`tional unit, and where the computational unit and the data
`access unit are configured by a program.
`The present
`invention also involves a reconfigurable
`processor that includes a first memory of a first type and a
`data prefetch unit coupled to the memory, where the data
`prefetch unit retrieves data from a second memory of a
`second type different from the first type, and the first and
`second memory types and the data prefetch unit are config-
`ured by a program.
`Another embodiment of the invention includes a recon-
`
`figurable hardware system that includes a common memory,
`also referred to as external memory, and one or more
`reconfigurable processors coupled to the common memory,
`where at least one of the reconfigurable processors includes
`a data prefetch unit to read and write data between the unit
`and the common memory, and where the data prefetch unit
`is configured by a program executed on the system.
`Another embodiment of the invention includes a method
`
`of transferring data that includes transferring data between a
`memory and a data prefetch unit in a reconfigurable pro-
`cessor, transferring data between the prefetch unit and a data
`access unit, and transferring the data between a computa-
`tional unit and the data access unit, where the computational
`unit, data access unit and the data prefetch unit are config-
`ured by a program.
`Additional embodiments of the invention are set forth in
`
`part in the description that follows, and in part will become
`apparent to those skilled in the art upon examination of the
`following specification, or may be learned by the practice of
`the invention. The advantages of the invention may be
`realized and attained by means of the instrumentalities,
`combinations,
`compositions,
`and methods particularly
`pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 shows a reconfigurable processor in which the
`present invention may be implemented;
`FIG. 2 shows computational logic as might be loaded into
`a reconfigurable processor;
`FIG. 3 shows a reconfigurable processor as in FIG. 1, but
`with the addition of data access units;
`FIG. 4 shows a reconfigurable processor as in FIG. 3, but
`with the addition of data prefetch units;
`FIG. 5 shows reconfigurable processor with the inclusion
`of external memory;
`FIG. 6 shows reconfigurable processors with external
`memory and with an intelligent memory controller;
`FIG. 7 shows a reconfigurable processor having a com-
`bination of data prefetch units and data access units feeding
`computational logic;
`FIG. 8 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`references;
`FIG. 9A and FIG. 9B show the bandwidth efficiency and
`utilization gains obtained when utilizing a data prefetch unit
`and an intelligent memory controller to perform subset
`memory references in X-Y plane;
`FIG. 10A and FIG. 10B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset
`memory references in X-Z plane;
`FIG. 11A and FIG. 11B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 17 of 22 PageID# 80
`Case 2:17-cv-00547—RAJ-RJK Document 1-2 Filed 10/18/17 Page 17 of 22 PageID# 80
`
`US 7,149,867 B2
`
`5
`unit and an intelligent memory controller to perform subset
`memory references in Y—Z plane;
`FIG. 12A and FIG. 12B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform subset
`memory references in a mini-cube;
`FIG. 13 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform indirect memory
`references;
`FIG. 14 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform strided memory
`reference together with computation.
`
`DETAILED DESCRIPTION
`
`1. Definitions:
`
`Direct execution logic (DEL)7is an assemblage of
`dynamically reconfigurable functional elements that enables
`a program to establish an optimized interconnection among
`selected functional units in order to implement a desired
`computational, data prefetch and/or data access functionality
`for maximizing the parallelism inherent in the particular
`code.
`
`Reconfigurable Processoriis a computing device that
`contains reconfigurable components such as FPGAs and
`can,
`through reconfiguration,
`instantiate an algorithm as
`hardware.
`
`Reconfigurable Logiciis composed of an interconnec-
`tion of functional units, control, and storage that implements
`an algorithm and can be loaded into a Reconfigurable
`Processor.
`
`Functional Unitiis a set of logic that performs a specific
`operation. The operation may for example be arithmetic,
`logical, control, or data movement. Functional units are used
`as building blocks of reconfigurable logic.
`Macroiis another name for a functional unit.
`
`Memory Hierarchyiis a collection of memories
`Data prefetch Unitiis a functional unit that moves data
`between members of a memory hierarchy. The movement
`may be as simple as a copy, or as complex as an indirect
`indexed strided copy into a unit stride memory.
`Data access Unitiis a functional unit that accesses a
`
`component of a memory hierarchy, and delivers data directly
`to computational logic.
`Intelligent Memory Control Unitiis a control unit that
`has the ability to select data from its storage according to a
`variety of algorithms that can be selected by a data requester,
`such as a data prefetch unit.
`Bandwidth Efficiencyiis defined as the percentage of
`contributory data transferred between two points. Contribu-
`tory data is data that actually participates in the recipients
`processing.
`Bandwidth Utilizationiis defined as the percentage of
`maximum bandwidth between two points that is actually
`used to pass contributory data.
`2. Description
`A reconfigurable processor (RP) 100 implements direct
`executable logic (DEL) to perform computation, as well a
`memory hierarchy for maintaining input data and computa-
`tional results. DEL is an assemblage of dynamically recon-
`figurable functional elements that enables a program to
`establish an optimized interconnection among selected func-
`tional units in order to implement a desired computational,
`data prefetch and/or data access functionality for maximiz-
`ing the parallelism inherent in the particular code. The term
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`DEL may also be used to refer to the set of constructs such
`as code, data, configuration variables, and the like that can
`be loaded into RP 100 to cause RP 100 to implement a
`particular assemblage of functional elements.
`FIG. 1 presents an RP 100, which may be implemented
`using field programmable gate arrays (FPGAs) or other
`reconfigurable logic devices, that can be configured and
`reconfigured to contain functional units and interconnecting
`circuits, and a memory hierarchy comprising on-board
`memory banks 104, on—chip block RAM 106, registers
`wires, and a connection 108 to external memory. On—chip
`reconfigurable components 102 create memory structures
`such as registers, FIFOs, wires and arrays using block RAM.
`Dual-ported memory 106 is shared between on—chip recon-
`figurable components 102. The reconfigurable processor 100
`also implements user-defined computational logic (e.g., such
`as DEL 200 shown in FIG. 2) constructed by programming
`an FPGA to implement a particular interconnection of
`computational functional units. In a particular implementa-
`tion, a number of RPs 100 are implemented within a
`memory subsystem of a conventional computer, such as on
`devices that are physically installed in dual inline memory
`module (DIMM) sockets of a computer. In this manner the
`RPs 100 can be accessed by memory operations and so
`coexist well with a more conventional hardware platform. It
`should be noted that, although the exemplary implementa-
`tion of the present invention illustrated includes six banks of
`dual ported memory 104 and two reconfigurable compo-
`nents 102, any number of memory banks and/or reconfig-
`urable components may be used depending upon the par-
`ticular implementation or application.
`Any computer program, including complex graphics pro-
`cessing programs, word processing programs, database pro-
`grams and the like, is a collection of algorithms that interact
`to implement desired functionality. In the common case in
`which static computing hardware resources are used (e.g., a
`conventional microprocessor),
`the computer program is
`compiled into a set of executable code (i.e., object code)
`units that are linked together to implement the computer
`program on the particular hardware resources. The execut-
`able code is generated specifically for a particular hardware
`platform. In this manner, the computer program is adapted
`to conform to the limitations of the static hardware platform.
`However, the compilation process makes many compro-
`mises based on the limitations of the static hardware plat-
`form.
`
`Alternatively, an algorithm can be defined in a high level
`language then compiled into DEL. DEL can be produced via
`a compiler from high level programming languages such as
`C or FORTRAN or may be designed using a hardware
`definition language such as Verilog, VHDL or a schematic
`capture tool. Computation is performed by reconfiguring a
`reconfigurable processor with the DEL and flowing data
`through the computation.
`In this manner,
`the hardware
`resources are essentially adapted to conform to the program
`rather than the program being adapted to conform to the
`hardware resources.
`
`For purposes of this description a single reconfigurable
`processor will be presented first. A sample of computational
`logic 201 is shown in FIG. 2. This simple assemblage of
`functional units performs computation of two results (“A+
`B” and “A+B—(B*C)) from three input variables or operands
`“A”, “B” and “C”. In practice, computational units 201 can
`be implemented to perform very simple or arbitrarily com-
`plex computations. The input variables (operands) and out-
`put or result variables may be of any size necessary for a
`particular application. Theoretically, any number of oper-
`
`
`
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 18 of 22 PageID# 81
`Case 2:17-cv-00547-RAJ-RJK Document 1-2 Filed 10/18/17 Page 18 of 22 PageID# 81
`
`US 7,149,867 B2
`
`7
`ands and result variables may be used/generated by a
`particular DEL. Great complexity of computation can be
`supported by adding additional reconfigurable chips and
`processors.
`
`For greatest performance the DEL 200 is constructed as
`parallel pipelined logic blocks composed of computational
`functional units capable of taking data and producing results
`with each clock pulse. The highest possible performance that
`can be achieved is computation of a set of results with each
`clock pulse. To achieve this, data should be available at the
`same rate the computation can consume the data. The rate at
`which data can be supplied to DEL 200 is determined, at
`least in significant part, by the memory bandwidth utiliza-
`tion and efficiency. Maximal computational performance
`can be achieved with parallel and pipelined DEL together
`with maximizing the memory bandwidth utilization and
`efficiency. Unlike conventional static hardware platforms,
`however, the memory hierarchy provided in a RP 100 is
`reconfigurable. In accordance with the present invention,
`through the use of data access units and associated memory
`hierarchy components, computational demands and memory
`bandwidth can be matched.
`
`High memory bandwidth efficiency is achieved when only
`data required for computation is moved within the memory
`hierarchy. FIG. 3 shows a simple logic block 300 comprising
`computational functional units 301, control (not shown), and
`data access functional units 303. The data access unit 303
`
`presents data directly to the computational logic 301. In this
`manner, data is moved from a memory device 305 to the
`computational logic and from the computational logic back
`into a memory device 305 or block RAM memory 307
`within an RP 100.
`
`FIG. 4 illustrates the logic block 300 with an addition of
`a data prefetch unit 401. The data prefetch unit 401 moves
`data from one member of the memory hierarchy 305 to
`another 308. Data prefetch unit 401 operates independently
`of other functional units 301, 302 and 303 and