`
`United States Patent
`P0Znanovic et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,149,867 B2
`Dec. 12, 2006
`
`US0071.49867B2
`
`(54) SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWDTH IN
`RECONFIGURABLE HARDWARE
`
`(75) Inventors: Daniel Poznanovic, Colorado Springs,
`CO (US); David E. Caliga, Colorado
`Springs, CO (US); Jeffrey Hammes,
`Colorado Springs, CO (US)
`(73) Assignee: SRC Computers, Inc., Colorado
`Springs, CO (US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days
`
`(21) Appl. No.: 10/869,200
`
`(22) Filed:
`
`Jun. 16, 2004
`
`(65)
`
`Prior Publication Data
`US 2004/0260884 A1
`Dec. 23, 2004
`Related U.S. Application Data
`(60) Provisional application No. 60/479,339, filed on Jun.
`18, 2003.
`s
`(51) Int. Cl.
`(2006.01)
`G06F 12/00
`(52) U.S. Cl. ....................................... 711/170, 711/154
`(58) Field of Classification Search
`711/17O 173:
`712/1 s
`- - - - - - - -
`See application file for complete search histo
`pp
`p
`ry.
`References Cited
`U.S. PATENT DOCUMENTS
`
`(56)
`
`6/2000 Huppenthal et al.
`6,076, 152 A
`6,243,791 B1* 6/2001 Vondran, Jr. ................ T11 120
`6,247,110 B1
`6/2001 Huppenthal et al.
`6,460,122 B 1 * 10/2002 Otterness et al. ........... T11,154
`6,507,898 B1* 1/2003 Gibson et al. .............. T11 168
`6,594.736 B1
`6, 2003 Parks
`
`6,714,041 B1* 3/2004 Darling et al. ................ 326/38
`2003/0046492 A1
`3/2003 Gschwind et al.
`... 711.118
`2003/0046530 A1
`3/2003 Poznanovic ................. T13/100
`2003/0084.244 A1* 5/2003 Paulraj ....................... T11 118
`2003, OO88737 A1* 5, 2003 Burton ...
`... 711 118
`2003/0208658 A1* 11/2003 Magoshi ...........
`... 711/122
`2005/0044327 A1
`2/2005 Howard et al. ............. T11 147
`
`OTHER PUBLICATIONS
`“Summary: The Cache Read/Write Process.” The PC Guide, 2001,
`www.pc.guide.com/ref mbsyst cachef func.htm.
`Chien et al., “Safe and Protected Execution for the Morph/AMRM
`Reconfigurable Processor," IEEE, 1999, pp. 1-13.
`IEEE 100: The Authoritative Dictionary of IEEE Standards Terms,
`Standards Information Network, 2000, pp. 874.*
`* cited by examiner
`y
`Primary Examiner Gary Portka
`Assistant Examiner Shane M. Thomas
`(74) Attorney, Agent, or Firm—William J. Kubida; Michael
`C. Martensen; Hogan & Hartson LLP
`(57)
`ABSTRACT
`
`A reconfigurable processor that includes a computational
`unit and a data prefetch unit coupled to the computational
`unit, where the data prefetch unit retrieves data from a
`memory and Supplies the data to the computational unit
`through memory and a data access unit, and where the data
`prefetch unit, memory, and data access unit is configured by
`a program. Also, a reconfigurable hardware system that
`includes a common memory; and one or more reconfig
`urable processors coupled to the common memory, where at
`least one of the reconfigurable processors includes a data
`prefetch unit to read and write data between the unit and the
`common memory, and where the data prefetch unit is
`configured by a program executed on the system. In addi
`tion, a method of transferring data that includes transferring
`data between a memory and a data prefetch unit in a
`reconfigurable processor; and transferring the data between
`a computational unit and the data prefetch unit.
`
`19 Claims, 12 Drawing Sheets
`
`r
`
`MMORY
`BANKB
`
`r
`
`MMORY
`RANKc
`
`MEMORY
`BANKA
`
`To USRAM
`LXI
`
`TUPSTREAM
`RECONFISURABLE
`PROCESSOR MEMORY
`- - - - - - - - - - - - -
`
`305 f
`
`MEMORY
`
`
`
`
`
`PREFETCH
`STRIDED |
`BANKF
`
`Y Y
`PREFETCH
`NT
`
`
`
`MEMRY
`
`MEMry
`BANKO
`
`BLOCKRAM
`MMORY
`
`
`
`TodayNSTREAM
`L03C
`
`Intel Exhibit 1001 - 1
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 1 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`XJESTI (\/9d- “fire) Z O|SOOT,
`
`Intel Exhibit 1001 - 2
`
`
`
`U.S. Patent
`
`8,941.}7
`
`2B
`
`6IW0u0nu1‘mm1%mm1uIIIImcMoxz<mn:mxz<mn:mennDn.................................................................................u
`
`
`
`nowwon
`
`mom
`
`
`
`2m1nfPomm0u2Wtuwm%5mTH\8a
`
`522
`
`7i6mGE
`
`
`
`
`
`SSm$0szaxz<mU2,».x003Eosms.
`
`
`
`8m3%an992.315
`
`mom
`
`Intel Exhi
`
`it 1001 - 3
`
`Intel Exhibit 1001 - 3
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 3 of 12
`
`US 7,149,867 B2
`
`809
`
`ÅRHOWE'W
`
`
`
`
`
`
`
`90€
`
`00€
`
`909
`
`Intel Exhibit 1001 - 4
`
`
`
`U.S. Patent
`
`2,1ceD
`
`00
`
`6................................
`
`2AH\mom
`
`
`
`FSofia895m895mLIr\8
`
`
`
`
`
`IUPmmwmn.IOqummmIOHmmm—WE
`
`IOmemmm
`
`wMQWFW
`
`IOHmnfimn.
`
`10$“.me
`
`_‘QOhw
`
`EQEmLJ\Sm
`
`IDI—SAW
`
`mom
`
`
`
`
`
`mommijNono>m0_>_m—>_._.Zm0_I_._m._.Z_
`
`
`
`>mO_>_m=>_4<ZMmFXm
`
`
`
`>m02m=2I_<ZMM._.Xm
`
`fu.oW4mtmeueuhWSM8m
`
`“08
`
`RV...................................................................................UmIII
`
`7,
`
`68,
`
`m
`
`ME05:0xz<m8m1:5.505$05.).
`
`SEEx004m
`
`>MO—2w2
`
`>m0_>_m__2
`
`0v_z<m_mom
`
`7m.OE
`
`mGE
`
`Intel Exhi
`
`it 1001 - 5
`
`Intel Exhibit 1001 - 5
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 5 of 12
`
`US 7,149,867 B2
`
`
`
`52$.kan:25$“925¢me0..
`
`5::oo<FLOZ
`
`anan8m
`
`569:
`
`5m
`
`uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
`
`onz<mI5
`
`mom_>_<mXOOIE
`
`>mO_>_m=>_
`
`>mO_>_w_>_
`
`Dxz<m
`
`mom
`
`IOFMumma
`
`DQOkm
`
`10.5“.ng
`
`Dmo_m._.m
`
`1051me
`
`89umLJ\
`
`won
`
`n.OI
`
`Intel Exhi
`
`it 1001 - 6
`
`Intel Exhibit 1001 - 6
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 6 of 12
`
`US 7,149,867 B2
`
`
`
`s
`
`3
`
`Intel Exhibit 1001 - 7
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 7 of 12
`
`US 7,149,867 B2
`
`
`
`Sar NX2-N-Sa-Aa Ne-N
`
`MA
`a.
`A
`4.
`
`
`
`Intel Exhibit 1001 - 8
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 8 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`Z-Z-Z-Z-Z-Z-Z-Z~7
`
`ZOETZ, ZOEZOEZ, ZOEZOEZI, Z
`
`
`
`
`
`N
`A N AN A
`N
`A
`SNNNNNNNNNNSNNNYNNNNNNNNNN
`
`
`
`4-------------------------------------Z-Z-Z-Z
`
`Intel Exhibit 1001 - 9
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 9 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`Intel Exhibit 1001 - 10
`
`
`
`U.S. Patent
`
`US 7,149,867 B2
`
`
`
`
`
`
`
`
`
`EEEEEEEE|
`
`Intel Exhibit 1001 - 11
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 11 of 12
`
`US 7,149,867 B2
`
`
`
`1.
`1
`D
`
`1.
`
`Of) 5 H
`
`s
`O
`
`A s
`
`Intel Exhibit 1001 - 12
`
`
`
`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 12 of 12
`
`US 7,149,867 B2
`
`
`
`C
`
`1.
`t
`
`O 3.
`
`S
`O
`
`s
`
`Intel Exhibit 1001 - 13
`
`
`
`US 7,149,867 B2
`
`1.
`SYSTEMAND METHOD OF ENHANCNG
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`RELATED APPLICATIONS
`
`The present invention claims the benefit of U.S. Provi
`sional Patent application Ser. No. 60/479,339 filed on Jun.
`18, 2003, which is incorporated herein by reference in its
`entirety.
`
`10
`
`BACKGROUND OF THE INVENTION
`
`2
`For programs with a high degree of spatial locality (e.g.,
`stride-one access), wide cache lines are more efficient since
`they reduce the number of times a processor has to suffer the
`latency of a memory access. However, for programs with
`lower levels of spatial locality, or random access, narrow
`lines are best as they reduce the wasted bandwidth from the
`unused neighbors in the cache line. Caches designed with
`wide cache lines perform well with programs that have a
`high degree of spatial locality, but generally have poor
`gather/scatter performance. Likewise, caches with short
`cache lines have good gather/scatter performance, but loose
`efficiency executing programs with high spatial locality
`because of the additional runs to the main memory.
`Another consideration in cache design is cache associa
`tivity, which refers to the mapping between locations in
`main memory and cache sectors. At one extreme of cache
`associativity is a direct-mapped cache, while at another
`extreme is a fully associative cache. In a direct mapped
`cache, a specific memory location can be mapped to only a
`single cache line. Direct-mapped caches have the advantage
`of being fast and easy to construct in logic. The disadvantage
`is that they suffer the maximum number of cache conflicts.
`At the other extreme, a fully associative cache allows a
`specific location in memory to be mapped to any cache line.
`Fully associative caches tend to be slower and more com
`plex due to the large amount of comparison logic they need,
`but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative
`caches. A design point between the extremes is a k-set
`associative cache, where each memory location can map to
`k cache sectors. These caches generally have less overhead
`than fully associative caches, and reduce cache conflicts by
`increasing the value of k.
`Another consideration in cache design is how cache lines
`are replaced due to a capacity or conflict miss. In a direct
`mapped cache, there is only one possible cache line that can
`be replaced due to a miss. However, in caches with higher
`levels of associativity, cache lines can be replaced in more
`that one way. The way the cache lines are replaced is
`referred to as the replacement policy.
`Options for the replacement policy include least recently
`used (LRU), random replacement, and first in-first out
`(FIFO). LRU is used in the majority of circumstances where
`the temporal locality set is smaller than the cache size, but
`it is normally more expensive to build in hardware than a
`random replacement cache. An LRU policy can also quickly
`degrade depending on the working set size. For example,
`consider an iterative application with a matrix size of N
`bytes running through a LRU cache of size M bytes. If N is
`less than M, then the policy has the desired behavior of
`100% cache hits, however, if N is only slightly larger than
`M, the LRU policy results in 0% cache hits as lines are
`removed just as they are needed.
`Another consideration is deciding on a write policy for the
`cache. Write-through caches send data through the cache
`hierarchy to main memory. This policy reduces cache coher
`ency issues for multiple processor Systems and is best Suited
`for data that will not be re-read by the processor in the
`immediate future. In contrast, write-back caches place a
`copy of the data in the cache, but does not immediately
`update main memory. This type of caching works best when
`a data just written to the cache is quickly requested again by
`the processor.
`In addition to write-through and write-back caches,
`another kind of write policy is implemented in a write
`allocate cache where a cache line is allocated on a write that
`misses in cache. Write-allocate caches improve performance
`
`1. Field of the Invention
`The present invention relates, in general, to enhancing the
`efficiency and utilization of memory bandwidth in reconfig
`urable hardware. More specifically, the invention relates to
`implementing explicit memory hierarchies in reconfigurable
`processors that make efficient use of off-board, on-board,
`on-chip storage and available algorithm locality. These
`explicit memory hierarchies avoid many of the tradeoffs and
`complexities found in the traditional memory hierarchies of
`microprocessors.
`2. Relevant Background
`Over the past 30 years, microprocessors have enjoyed
`annual performance gains averaging about 50% per year.
`Most of the gains can be attributed to higher processor clock
`speeds, more memory bandwidth and increasing utilization
`of instruction level parallelism (ILP) at execution time.
`As microprocessors and other dense logic devices (DLDs)
`consume data at ever-increasing rates it becomes more of a
`challenge to design memory hierarchies that can keep up.
`Two measures of the gap between the microprocessor and
`memory hierarchy are bandwidth efficiency and bandwidth
`utilization. Bandwidth efficiency refers to the ability to
`exploit available locality in a program or algorithm. In the
`ideal situation, when there is maximum bandwidth effi
`ciency, all available locality is utilized. Bandwidth utiliza
`tion refers to the amount of memory bandwidth that is
`utilized during a calculation. Maximum bandwidth utiliza
`tion occurs when all available memory bandwidth is uti
`lized.
`Potential performance gains from using a faster micro
`processor can be reduced or even negated by a correspond
`ing drop in bandwidth efficiency and bandwidth utilization.
`Thus, there has been significant effort spent on the devel
`opment of memory hierarchies that can maintain high band
`width efficiency and utilization with faster microprocessors.
`One approach to improving bandwidth efficiency and
`utilization in memory hierarchies has been to develop ever
`more powerful processor caches. These caches are high
`speed memories (typically SRAM) in close proximity to the
`microprocessor that try to keep copies of instructions and
`data the microprocessor may soon need. The microprocessor
`can store and retrieve data from the cache at a much higher
`rate than from a slower, more distant main memory.
`In designing cache memories, there are a number of
`considerations to take into account. One consideration is the
`width of the cache line. Caches are arranged in lines to help
`hide memory latency and exploit spatial locality. When a
`load suffers a cache miss, a new cache line is loaded from
`main memory into the cache. The assumption is that a
`program being executed by the microprocessor has a high
`degree of spatial locality, making it likely that other memory
`locations in the cache line will also be required.
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Intel Exhibit 1001 - 14
`
`
`
`US 7,149,867 B2
`
`3
`when the microprocessor exhibits a lot of write followed by
`read behavior. However, when writes are not subsequently
`read, a write-allocate cache has a number of disadvantages:
`When a cache line is allocated, it is necessary to read the
`remaining values from main memory to complete the cache
`line. This adds unnecessary memory read traffic during store
`operations. Also, when the data is not read again, potentially
`useful data in the cache is displaced by the unused data.
`Another consideration is made between the size and the
`speed of the cache: Small caches are typically much faster
`than larger caches, but store less data and fewer instructions.
`Less data means a greater chance the cache will not have
`data the microprocessor is requesting (i.e., a cache miss)
`which can slow everything down while the data is being
`retrieved from the main memory.
`Newer cache designs reduce the frequency of cache
`misses by trying to predict in advance the data that the
`microprocessor will request. An example of this type of
`cache is one that Supports speculative execution and branch
`prediction. Speculative execution allows instructions that
`likely will be executed to start early based on branch
`prediction. Results are stored in a cache called a reorder
`buffer and retired if the branch was correctly predicted. Of
`course, when mis-predictions occur instruction and data
`bandwidth are wasted.
`There are additional considerations and tradeoffs in cache
`design, but it should be apparent from the considerations
`described hereinbefore that it is very difficult to design a
`single cache structure that is optimized for many different
`programs. This makes cache design particularly challenging
`for a multipurpose microprocessor that executes a wide
`variety of programs. Cache designers try to derive the
`program behavior of “average' program constructed from
`several actual programs that run on the microprocessor. The
`cache is optimized for the average program, but no actual
`program behaves exactly like the average program. As a
`result, the designed cache ends up being Sub-optimal for
`nearly every program actually executed by the micropro
`cessor. Thus, there is a need for memory hierarchies that
`have data storage and retrieval characteristics that are opti
`mized for actual programs executed by a processor.
`Designers trying to develop ever more efficient caches
`optimized for a variety of actual programs also face another
`problem: as caches add additional features, the overhead
`needed to implement the added features also grows. Caches
`today have so much overhead that microprocessor perfor
`mance may be reaching a point of diminishing returns as the
`overhead starts to cut into performance. In the Intel Pentium
`III processor for example, more than half of the 10 million
`transistors are dedicated to instruction cache, branch pre
`diction, out-of-order execution and SuperScalar logic. The
`situation has prompted predictions that as microprocessors
`grow to a billion transistors per chip, performance increases
`will drop to about 20% per year. Such a prediction, if borne
`out, could have a significant impact on technology growth
`and the computer business.
`Thus, there is a growing need to develop improved
`memory hierarchies that limit the overhead of a memory
`hierarchy without also reducing bandwidth efficiency and
`utilization.
`
`4
`processor memory and Supplies the data to the computa
`tional unit, and where the computational unit and the data
`access unit are configured by a program.
`The present invention also involves a reconfigurable
`processor that includes a first memory of a first type and a
`data prefetch unit coupled to the memory, where the data
`prefetch unit retrieves data from a second memory of a
`second type different from the first type, and the first and
`second memory types and the data prefetch unit are config
`ured by a program.
`Another embodiment of the invention includes a recon
`figurable hardware system that includes a common memory,
`also referred to as external memory, and one or more
`reconfigurable processors coupled to the common memory,
`where at least one of the reconfigurable processors includes
`a data prefetch unit to read and write data between the unit
`and the common memory, and where the data prefetch unit
`is configured by a program executed on the system.
`Another embodiment of the invention includes a method
`of transferring data that includes transferring data between a
`memory and a data prefetch unit in a reconfigurable pro
`cessor, transferring data between the prefetch unit and a data
`access unit, and transferring the data between a computa
`tional unit and the data access unit, where the computational
`unit, data access unit and the data prefetch unit are config
`ured by a program.
`Additional embodiments of the invention are set forth in
`part in the description that follows, and in part will become
`apparent to those skilled in the art upon examination of the
`following specification, or may be learned by the practice of
`the invention. The advantages of the invention may be
`realized and attained by means of the instrumentalities,
`combinations, compositions, and methods particularly
`pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 shows a reconfigurable processor in which the
`present invention may be implemented;
`FIG. 2 shows computational logic as might be loaded into
`a reconfigurable processor,
`FIG. 3 shows a reconfigurable processor as in FIG. 1, but
`with the addition of data access units;
`FIG. 4 shows a reconfigurable processor as in FIG. 3, but
`with the addition of data prefetch units:
`FIG. 5 shows reconfigurable processor with the inclusion
`of external memory;
`FIG. 6 shows reconfigurable processors with external
`memory and with an intelligent memory controller,
`FIG. 7 shows a reconfigurable processor having a com
`bination of data prefetch units and data access units feeding
`computational logic;
`FIG. 8 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform Strided memory
`references;
`FIG. 9A and FIG.9B show the bandwidth efficiency and
`utilization gains obtained when utilizing a data prefetch unit
`and an intelligent memory controller to perform Subset
`memory references in X-Y plane;
`FIG. 10A and FIG. 10B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform Subset
`memory references in X-Z plane;
`FIG. 11A and FIG. 11B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`SUMMARY OF THE INVENTION
`
`Accordingly, an embodiment of the invention includes a
`reconfigurable processor that includes a computational unit
`and a data access unit coupled to the computational unit,
`where the data access unit retrieves data from an on
`
`65
`
`Intel Exhibit 1001 - 15
`
`
`
`US 7,149,867 B2
`
`5
`unit and an intelligent memory controller to perform Subset
`memory references in Y-Z plane;
`FIG. 12A and FIG. 12B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform Subset
`memory references in a mini-cube:
`FIG. 13 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform indirect memory
`references;
`FIG. 14 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform Strided memory
`reference together with computation.
`
`DETAILED DESCRIPTION
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`1. Definitions:
`Direct execution logic (DEL)—is an assemblage of
`dynamically reconfigurable functional elements that enables
`a program to establish an optimized interconnection among
`selected functional units in order to implement a desired
`computational, data prefetch and/or data access functionality
`for maximizing the parallelism inherent in the particular
`code.
`Reconfigurable Processor is a computing device that
`contains reconfigurable components such as FPGAs and
`can, through reconfiguration, instantiate an algorithm as
`hardware.
`Reconfigurable Logic—is composed of an interconnec
`tion of functional units, control, and storage that implements
`an algorithm and can be loaded into a Reconfigurable
`Processor.
`Functional Unit is a set of logic that performs a specific
`operation. The operation may for example be arithmetic,
`logical, control, or data movement. Functional units are used
`as building blocks of reconfigurable logic.
`Macro is another name for a functional unit.
`Memory Hierarchy is a collection of memories
`Data prefetch Unit is a functional unit that moves data
`between members of a memory hierarchy. The movement
`may be as simple as a copy, or as complex as an indirect
`indexed strided copy into a unit stride memory.
`Data access Unit is a functional unit that accesses a
`component of a memory hierarchy, and delivers data directly
`to computational logic.
`Intelligent Memory Control Unit is a control unit that
`has the ability to select data from its storage according to a
`variety of algorithms that can be selected by a data requester,
`Such as a data prefetch unit.
`50
`Bandwidth Efficiency is defined as the percentage of
`contributory data transferred between two points. Contribu
`tory data is data that actually participates in the recipients
`processing.
`Bandwidth Utilization is defined as the percentage of
`55
`maximum bandwidth between two points that is actually
`used to pass contributory data.
`2. Description
`A reconfigurable processor (RP) 100 implements direct
`executable logic (DEL) to perform computation, as well a
`memory hierarchy for maintaining input data and computa
`tional results. DEL is an assemblage of dynamically recon
`figurable functional elements that enables a program to
`establish an optimized interconnection among selected func
`tional units in order to implement a desired computational,
`data prefetch and/or data access functionality for maximiz
`ing the parallelism inherent in the particular code. The term
`
`40
`
`45
`
`60
`
`65
`
`6
`DEL may also be used to refer to the set of constructs such
`as code, data, configuration variables, and the like that can
`be loaded into RP 100 to cause RP 100 to implement a
`particular assemblage of functional elements.
`FIG. 1 presents an RP 100, which may be implemented
`using field programmable gate arrays (FPGAs) or other
`reconfigurable logic devices, that can be configured and
`reconfigured to contain functional units and interconnecting
`circuits, and a memory hierarchy comprising on-board
`memory banks 104, on-chip block RAM 106, registers
`wires, and a connection 108 to external memory. On-chip
`reconfigurable components 102 create memory structures
`such as registers, FIFOs, wires and arrays using block RAM.
`Dual-ported memory 106 is shared between on-chip recon
`figurable components 102. The reconfigurable processor 100
`also implements user-defined computational logic (e.g., Such
`as DEL 200 shown in FIG. 2) constructed by programming
`an FPGA to implement a particular interconnection of
`computational functional units. In a particular implementa
`tion, a number of RPs 100 are implemented within a
`memory Subsystem of a conventional computer, such as on
`devices that are physically installed in dual inline memory
`module (DIMM) sockets of a computer. In this manner the
`RPs 100 can be accessed by memory operations and so
`coexist well with a more conventional hardware platform. It
`should be noted that, although the exemplary implementa
`tion of the present invention illustrated includes six banks of
`dual ported memory 104 and two reconfigurable compo
`nents 102, any number of memory banks and/or reconfig
`urable components may be used depending upon the par
`ticular implementation or application.
`Any computer program, including complex graphics pro
`cessing programs, word processing programs, database pro
`grams and the like, is a collection of algorithms that interact
`to implement desired functionality. In the common case in
`which static computing hardware resources are used (e.g., a
`conventional microprocessor), the computer program is
`compiled into a set of executable code (i.e., object code)
`units that are linked together to implement the computer
`program on the particular hardware resources. The execut
`able code is generated specifically for a particular hardware
`platform. In this manner, the computer program is adapted
`to conform to the limitations of the static hardware platform.
`However, the compilation process makes many compro
`mises based on the limitations of the static hardware plat
`form.
`Alternatively, an algorithm can be defined in a high level
`language then compiled into DEL. DEL can be produced via
`a compiler from high level programming languages such as
`C or FORTRAN or may be designed using a hardware
`definition language Such as Verilog, VHDL or a schematic
`capture tool. Computation is performed by reconfiguring a
`reconfigurable processor with the DEL and flowing data
`through the computation. In this manner, the hardware
`resources are essentially adapted to conform to the program
`rather than the program being adapted to conform to the
`hardware resources.
`For purposes of this description a single reconfigurable
`processor will be presented first. A sample of computational
`logic 201 is shown in FIG. 2. This simple assemblage of
`functional units performs computation of two results ("A+
`B' and “A+B-(B*C)) from three input variables or operands
`“A”, “B” and “C”. In practice, computational units 201 can
`be implemented to perform very simple or arbitrarily com
`plex computations. The input variables (operands) and out
`put or result variables may be of any size necessary for a
`particular application. Theoretically, any number of oper
`
`Intel Exhibit 1001 - 16
`
`
`
`US 7,149,867 B2
`
`10
`
`15
`
`25
`
`30
`
`7
`ands and result variables may be used/generated by a
`particular DEL. Great complexity of computation can be
`Supported by adding additional reconfigurable chips and
`processors.
`For greatest performance the DEL 200 is constructed as
`parallel pipelined logic blocks composed of computational
`functional units capable of taking data and producing results
`with each clock pulse. The highest possible performance that
`can be achieved is computation of a set of results with each
`clock pulse. To achieve this, data should be available at the
`same rate the computation can consume the data. The rate at
`which data can be supplied to DEL 200 is determined, at
`least in significant part, by the memory bandwidth utiliza
`tion and efficiency. Maximal computational performance
`can be achieved with parallel and pipelined DEL together
`with maximizing the memory bandwidth utilization and
`efficiency. Unlike conventional static hardware platforms,
`however, the memory hierarchy provided in a RP 100 is
`reconfigurable. In accordance with the present invention,
`through the use of data access units and associated memory
`hierarchy components, computational demands and memory
`bandwidth can be matched.
`High memory bandwidth efficiency is achieved when only
`data required for computation is moved within the memory
`hierarchy. FIG.3 shows a simple logic block 300 comprising
`computational functional units 301, control (not shown), and
`data access functional units 303. The data access unit 303
`presents data directly to the computational logic 301. In this
`manner, data is moved from a memory device 305 to the
`computational logic and from the computational logic back
`into a memory device 305 or block RAM memory 307
`within an RP 100.
`FIG. 4 illustrates the logic block 300 with an addition of
`a data prefetch unit 401. The data prefetch unit 401 moves
`35
`data from one member of the memory hierarchy 305 to
`another 308. Data prefetch unit 401 operates independently
`of other functional units 301, 302 and 303 and can therefore
`operate prior to, in parallel with, or after computational
`logic. This independence of operation permits hiding the
`latency associated with obtaining data for use in computa
`tion. The data prefetch unit deposits data into the memory
`hierarchy within RP 100, where computational logic 301,
`302 and 303 can access it through data access units. In the
`example of FIG.4, prefetch unit 401 is configured to deposit
`data into block RAM memory 308. Hence, the prefetch units
`401 may be operated independently of logic block 300 that
`uses prefetched data.
`An important feature of the present invention is that many
`types of data prefetch units can be defined so that the
`prefetch hardware can be configured to conform to the needs
`of the algorithms currently implemented by the computa
`tional logic. The specific characteristics of the prefetch can
`be matched with the needs of the computational logic and
`the format and location of data in the memory hierarchy. For
`example, FIG. 9A and FIG. 9B show an external memory
`that is organized in a 128 byte (16 word) block structure.
`This organization is optimized for Stride 1 access of cache
`based computers. A stride 128 access can result in a very
`inefficient use of bandwidth from the memory, since an extra
`120 bytes of data is moved for every 8 bytes of requested
`data yielding a 6.25% bandwidth efficiency.
`FIG. 5 shows an example of data prefetch in which there
`are no bandwidth gains since all data fetched from external
`memory blocks is also transferred and used in computational
`units 301 through memory bank access units 303. However,
`bandwidth utilization is increased due to the ability of the
`
`55
`
`8
`data prefetch units 501 to initiate a data transfer in advance
`of the requirement for data by computational logic.
`In accordance with an embodiment of the present inven
`tion, data prefetch units 601 are configured to communicate
`with an intelligent memory controller 603 in FIG. 6 and can
`extract only the desired 8 bytes of data, discard the remain
`der of the memory block, and transmit to the data prefetch
`unit only the requested portion of the stride 128 data. The
`prefetch units 601 then delivers that data to the appropriate
`memory components within the memory hierarchy of the
`logic block 300.
`FIG. 6 shows the prefetch units 601 delivering data to the
`RP's onboard memory banks 305. An onboard memory bank
`data access unit 303 then delivers the data to computational
`logic 301 when required. The data prefetch units 501 couple
`with an intelligent memory controller 601 in the implemen
`tation of FIG. 6 that supports a strided reference pattern,
`which yields a 100% bandwidth efficiency in contrast to the
`6.25% efficiency. Although illustrated as a single block of
`external memory, multiple numbers of external memories
`may be employed as well.
`In FIG. 7, the combination of data prefetch units 701 and
`data access units 703 feeding computational logic 301 such
`that bandwidth efficiency and utilization are maximized is
`shown in FIG. 7. In th