throbber
(12)
`
`United States Patent
`P0Znanovic et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,149,867 B2
`Dec. 12, 2006
`
`US0071.49867B2
`
`(54) SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWDTH IN
`RECONFIGURABLE HARDWARE
`
`(75) Inventors: Daniel Poznanovic, Colorado Springs,
`CO (US); David E. Caliga, Colorado
`Springs, CO (US); Jeffrey Hammes,
`Colorado Springs, CO (US)
`(73) Assignee: SRC Computers, Inc., Colorado
`Springs, CO (US)
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days
`
`(21) Appl. No.: 10/869,200
`
`(22) Filed:
`
`Jun. 16, 2004
`
`(65)
`
`Prior Publication Data
`US 2004/0260884 A1
`Dec. 23, 2004
`Related U.S. Application Data
`(60) Provisional application No. 60/479,339, filed on Jun.
`18, 2003.
`s
`(51) Int. Cl.
`(2006.01)
`G06F 12/00
`(52) U.S. Cl. ....................................... 711/170, 711/154
`(58) Field of Classification Search
`711/17O 173:
`712/1 s
`- - - - - - - -
`See application file for complete search histo
`pp
`p
`ry.
`References Cited
`U.S. PATENT DOCUMENTS
`
`(56)
`
`6/2000 Huppenthal et al.
`6,076, 152 A
`6,243,791 B1* 6/2001 Vondran, Jr. ................ T11 120
`6,247,110 B1
`6/2001 Huppenthal et al.
`6,460,122 B 1 * 10/2002 Otterness et al. ........... T11,154
`6,507,898 B1* 1/2003 Gibson et al. .............. T11 168
`6,594.736 B1
`6, 2003 Parks
`
`6,714,041 B1* 3/2004 Darling et al. ................ 326/38
`2003/0046492 A1
`3/2003 Gschwind et al.
`... 711.118
`2003/0046530 A1
`3/2003 Poznanovic ................. T13/100
`2003/0084.244 A1* 5/2003 Paulraj ....................... T11 118
`2003, OO88737 A1* 5, 2003 Burton ...
`... 711 118
`2003/0208658 A1* 11/2003 Magoshi ...........
`... 711/122
`2005/0044327 A1
`2/2005 Howard et al. ............. T11 147
`
`OTHER PUBLICATIONS
`“Summary: The Cache Read/Write Process.” The PC Guide, 2001,
`www.pc.guide.com/ref mbsyst cachef func.htm.
`Chien et al., “Safe and Protected Execution for the Morph/AMRM
`Reconfigurable Processor," IEEE, 1999, pp. 1-13.
`IEEE 100: The Authoritative Dictionary of IEEE Standards Terms,
`Standards Information Network, 2000, pp. 874.*
`* cited by examiner
`y
`Primary Examiner Gary Portka
`Assistant Examiner Shane M. Thomas
`(74) Attorney, Agent, or Firm—William J. Kubida; Michael
`C. Martensen; Hogan & Hartson LLP
`(57)
`ABSTRACT
`
`A reconfigurable processor that includes a computational
`unit and a data prefetch unit coupled to the computational
`unit, where the data prefetch unit retrieves data from a
`memory and Supplies the data to the computational unit
`through memory and a data access unit, and where the data
`prefetch unit, memory, and data access unit is configured by
`a program. Also, a reconfigurable hardware system that
`includes a common memory; and one or more reconfig
`urable processors coupled to the common memory, where at
`least one of the reconfigurable processors includes a data
`prefetch unit to read and write data between the unit and the
`common memory, and where the data prefetch unit is
`configured by a program executed on the system. In addi
`tion, a method of transferring data that includes transferring
`data between a memory and a data prefetch unit in a
`reconfigurable processor; and transferring the data between
`a computational unit and the data prefetch unit.
`
`19 Claims, 12 Drawing Sheets
`
`r
`
`MMORY
`BANKB
`
`r
`
`MMORY
`RANKc
`
`MEMORY
`BANKA
`
`To USRAM
`LXI
`
`TUPSTREAM
`RECONFISURABLE
`PROCESSOR MEMORY
`- - - - - - - - - - - - -
`
`305 f
`
`MEMORY
`
`
`
`
`
`PREFETCH
`STRIDED |
`BANKF
`
`Y Y
`PREFETCH
`NT
`
`
`
`MEMRY
`
`MEMry
`BANKO
`
`BLOCKRAM
`MMORY
`
`
`
`TodayNSTREAM
`L03C
`
`Intel Exhibit 1001 - 1
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 1 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`XJESTI (\/9d- “fire) Z O|SOOT,
`
`Intel Exhibit 1001 - 2
`
`

`

`U.S. Patent
`
`8,941.}7
`
`2B
`
`6IW0u0nu1‘mm1%mm1uIIIImcMoxz<mn:mxz<mn:mennDn.................................................................................u
`
`
`
`nowwon
`
`mom
`
`
`
`2m1nfPomm0u2Wtuwm%5mTH\8a
`
`522
`
`7i6mGE
`
`
`
`
`
`SSm$0szaxz<mU2,».x003Eosms.
`
`
`
`8m3%an992.315
`
`mom
`
`Intel Exhi
`
`it 1001 - 3
`
`Intel Exhibit 1001 - 3
`
`
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 3 of 12
`
`US 7,149,867 B2
`
`809
`
`ÅRHOWE'W
`
`
`
`
`
`
`
`90€
`
`00€
`
`909
`
`Intel Exhibit 1001 - 4
`
`

`

`U.S. Patent
`
`2,1ceD
`
`00
`
`6................................
`
`2AH\mom
`
`
`
`FSofia895m895mLIr\8
`
`
`
`
`
`IUPmmwmn.IOqummmIOHmmm—WE
`
`IOmemmm
`
`wMQWFW
`
`IOHmnfimn.
`
`10$“.me
`
`_‘QOhw
`
`EQEmLJ\Sm
`
`IDI—SAW
`
`mom
`
`
`
`
`
`mommijNono>m0_>_m—>_._.Zm0_I_._m._.Z_
`
`
`
`>mO_>_m=>_4<ZMmFXm
`
`
`
`>m02m=2I_<ZMM._.Xm
`
`fu.oW4mtmeueuhWSM8m
`
`“08
`
`RV...................................................................................UmIII
`
`7,
`
`68,
`
`m
`
`ME05:0xz<m8m1:5.505$05.).
`
`SEEx004m
`
`>MO—2w2
`
`>m0_>_m__2
`
`0v_z<m_mom
`
`7m.OE
`
`mGE
`
`Intel Exhi
`
`it 1001 - 5
`
`Intel Exhibit 1001 - 5
`
`
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 5 of 12
`
`US 7,149,867 B2
`
`
`
`52$.kan:25$“925¢me0..
`
`5::oo<FLOZ
`
`anan8m
`
`569:
`
`5m
`
`uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
`
`onz<mI5
`
`mom_>_<mXOOIE
`
`>mO_>_m=>_
`
`>mO_>_w_>_
`
`Dxz<m
`
`mom
`
`IOFMumma
`
`DQOkm
`
`10.5“.ng
`
`Dmo_m._.m
`
`1051me
`
`89umLJ\
`
`won
`
`n.OI
`
`Intel Exhi
`
`it 1001 - 6
`
`Intel Exhibit 1001 - 6
`
`
`
`
`
`
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 6 of 12
`
`US 7,149,867 B2
`
`
`
`s
`
`3
`
`Intel Exhibit 1001 - 7
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 7 of 12
`
`US 7,149,867 B2
`
`
`
`Sar NX2-N-Sa-Aa Ne-N
`
`MA
`a.
`A
`4.
`
`
`
`Intel Exhibit 1001 - 8
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 8 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`Z-Z-Z-Z-Z-Z-Z-Z~7
`
`ZOETZ, ZOEZOEZ, ZOEZOEZI, Z
`
`
`
`
`
`N
`A N AN A
`N
`A
`SNNNNNNNNNNSNNNYNNNNNNNNNN
`
`
`
`4-------------------------------------Z-Z-Z-Z
`
`Intel Exhibit 1001 - 9
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 9 of 12
`
`US 7,149,867 B2
`
`
`
`
`
`Intel Exhibit 1001 - 10
`
`

`

`U.S. Patent
`
`US 7,149,867 B2
`
`
`
`
`
`
`
`
`
`EEEEEEEE|
`
`Intel Exhibit 1001 - 11
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 11 of 12
`
`US 7,149,867 B2
`
`
`
`1.
`1
`D
`
`1.
`
`Of) 5 H
`
`s
`O
`
`A s
`
`Intel Exhibit 1001 - 12
`
`

`

`U.S. Patent
`
`Dec. 12, 2006
`
`Sheet 12 of 12
`
`US 7,149,867 B2
`
`
`
`C
`
`1.
`t
`
`O 3.
`
`S
`O
`
`s
`
`Intel Exhibit 1001 - 13
`
`

`

`US 7,149,867 B2
`
`1.
`SYSTEMAND METHOD OF ENHANCNG
`EFFICIENCY AND UTILIZATION OF
`MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`RELATED APPLICATIONS
`
`The present invention claims the benefit of U.S. Provi
`sional Patent application Ser. No. 60/479,339 filed on Jun.
`18, 2003, which is incorporated herein by reference in its
`entirety.
`
`10
`
`BACKGROUND OF THE INVENTION
`
`2
`For programs with a high degree of spatial locality (e.g.,
`stride-one access), wide cache lines are more efficient since
`they reduce the number of times a processor has to suffer the
`latency of a memory access. However, for programs with
`lower levels of spatial locality, or random access, narrow
`lines are best as they reduce the wasted bandwidth from the
`unused neighbors in the cache line. Caches designed with
`wide cache lines perform well with programs that have a
`high degree of spatial locality, but generally have poor
`gather/scatter performance. Likewise, caches with short
`cache lines have good gather/scatter performance, but loose
`efficiency executing programs with high spatial locality
`because of the additional runs to the main memory.
`Another consideration in cache design is cache associa
`tivity, which refers to the mapping between locations in
`main memory and cache sectors. At one extreme of cache
`associativity is a direct-mapped cache, while at another
`extreme is a fully associative cache. In a direct mapped
`cache, a specific memory location can be mapped to only a
`single cache line. Direct-mapped caches have the advantage
`of being fast and easy to construct in logic. The disadvantage
`is that they suffer the maximum number of cache conflicts.
`At the other extreme, a fully associative cache allows a
`specific location in memory to be mapped to any cache line.
`Fully associative caches tend to be slower and more com
`plex due to the large amount of comparison logic they need,
`but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative
`caches. A design point between the extremes is a k-set
`associative cache, where each memory location can map to
`k cache sectors. These caches generally have less overhead
`than fully associative caches, and reduce cache conflicts by
`increasing the value of k.
`Another consideration in cache design is how cache lines
`are replaced due to a capacity or conflict miss. In a direct
`mapped cache, there is only one possible cache line that can
`be replaced due to a miss. However, in caches with higher
`levels of associativity, cache lines can be replaced in more
`that one way. The way the cache lines are replaced is
`referred to as the replacement policy.
`Options for the replacement policy include least recently
`used (LRU), random replacement, and first in-first out
`(FIFO). LRU is used in the majority of circumstances where
`the temporal locality set is smaller than the cache size, but
`it is normally more expensive to build in hardware than a
`random replacement cache. An LRU policy can also quickly
`degrade depending on the working set size. For example,
`consider an iterative application with a matrix size of N
`bytes running through a LRU cache of size M bytes. If N is
`less than M, then the policy has the desired behavior of
`100% cache hits, however, if N is only slightly larger than
`M, the LRU policy results in 0% cache hits as lines are
`removed just as they are needed.
`Another consideration is deciding on a write policy for the
`cache. Write-through caches send data through the cache
`hierarchy to main memory. This policy reduces cache coher
`ency issues for multiple processor Systems and is best Suited
`for data that will not be re-read by the processor in the
`immediate future. In contrast, write-back caches place a
`copy of the data in the cache, but does not immediately
`update main memory. This type of caching works best when
`a data just written to the cache is quickly requested again by
`the processor.
`In addition to write-through and write-back caches,
`another kind of write policy is implemented in a write
`allocate cache where a cache line is allocated on a write that
`misses in cache. Write-allocate caches improve performance
`
`1. Field of the Invention
`The present invention relates, in general, to enhancing the
`efficiency and utilization of memory bandwidth in reconfig
`urable hardware. More specifically, the invention relates to
`implementing explicit memory hierarchies in reconfigurable
`processors that make efficient use of off-board, on-board,
`on-chip storage and available algorithm locality. These
`explicit memory hierarchies avoid many of the tradeoffs and
`complexities found in the traditional memory hierarchies of
`microprocessors.
`2. Relevant Background
`Over the past 30 years, microprocessors have enjoyed
`annual performance gains averaging about 50% per year.
`Most of the gains can be attributed to higher processor clock
`speeds, more memory bandwidth and increasing utilization
`of instruction level parallelism (ILP) at execution time.
`As microprocessors and other dense logic devices (DLDs)
`consume data at ever-increasing rates it becomes more of a
`challenge to design memory hierarchies that can keep up.
`Two measures of the gap between the microprocessor and
`memory hierarchy are bandwidth efficiency and bandwidth
`utilization. Bandwidth efficiency refers to the ability to
`exploit available locality in a program or algorithm. In the
`ideal situation, when there is maximum bandwidth effi
`ciency, all available locality is utilized. Bandwidth utiliza
`tion refers to the amount of memory bandwidth that is
`utilized during a calculation. Maximum bandwidth utiliza
`tion occurs when all available memory bandwidth is uti
`lized.
`Potential performance gains from using a faster micro
`processor can be reduced or even negated by a correspond
`ing drop in bandwidth efficiency and bandwidth utilization.
`Thus, there has been significant effort spent on the devel
`opment of memory hierarchies that can maintain high band
`width efficiency and utilization with faster microprocessors.
`One approach to improving bandwidth efficiency and
`utilization in memory hierarchies has been to develop ever
`more powerful processor caches. These caches are high
`speed memories (typically SRAM) in close proximity to the
`microprocessor that try to keep copies of instructions and
`data the microprocessor may soon need. The microprocessor
`can store and retrieve data from the cache at a much higher
`rate than from a slower, more distant main memory.
`In designing cache memories, there are a number of
`considerations to take into account. One consideration is the
`width of the cache line. Caches are arranged in lines to help
`hide memory latency and exploit spatial locality. When a
`load suffers a cache miss, a new cache line is loaded from
`main memory into the cache. The assumption is that a
`program being executed by the microprocessor has a high
`degree of spatial locality, making it likely that other memory
`locations in the cache line will also be required.
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Intel Exhibit 1001 - 14
`
`

`

`US 7,149,867 B2
`
`3
`when the microprocessor exhibits a lot of write followed by
`read behavior. However, when writes are not subsequently
`read, a write-allocate cache has a number of disadvantages:
`When a cache line is allocated, it is necessary to read the
`remaining values from main memory to complete the cache
`line. This adds unnecessary memory read traffic during store
`operations. Also, when the data is not read again, potentially
`useful data in the cache is displaced by the unused data.
`Another consideration is made between the size and the
`speed of the cache: Small caches are typically much faster
`than larger caches, but store less data and fewer instructions.
`Less data means a greater chance the cache will not have
`data the microprocessor is requesting (i.e., a cache miss)
`which can slow everything down while the data is being
`retrieved from the main memory.
`Newer cache designs reduce the frequency of cache
`misses by trying to predict in advance the data that the
`microprocessor will request. An example of this type of
`cache is one that Supports speculative execution and branch
`prediction. Speculative execution allows instructions that
`likely will be executed to start early based on branch
`prediction. Results are stored in a cache called a reorder
`buffer and retired if the branch was correctly predicted. Of
`course, when mis-predictions occur instruction and data
`bandwidth are wasted.
`There are additional considerations and tradeoffs in cache
`design, but it should be apparent from the considerations
`described hereinbefore that it is very difficult to design a
`single cache structure that is optimized for many different
`programs. This makes cache design particularly challenging
`for a multipurpose microprocessor that executes a wide
`variety of programs. Cache designers try to derive the
`program behavior of “average' program constructed from
`several actual programs that run on the microprocessor. The
`cache is optimized for the average program, but no actual
`program behaves exactly like the average program. As a
`result, the designed cache ends up being Sub-optimal for
`nearly every program actually executed by the micropro
`cessor. Thus, there is a need for memory hierarchies that
`have data storage and retrieval characteristics that are opti
`mized for actual programs executed by a processor.
`Designers trying to develop ever more efficient caches
`optimized for a variety of actual programs also face another
`problem: as caches add additional features, the overhead
`needed to implement the added features also grows. Caches
`today have so much overhead that microprocessor perfor
`mance may be reaching a point of diminishing returns as the
`overhead starts to cut into performance. In the Intel Pentium
`III processor for example, more than half of the 10 million
`transistors are dedicated to instruction cache, branch pre
`diction, out-of-order execution and SuperScalar logic. The
`situation has prompted predictions that as microprocessors
`grow to a billion transistors per chip, performance increases
`will drop to about 20% per year. Such a prediction, if borne
`out, could have a significant impact on technology growth
`and the computer business.
`Thus, there is a growing need to develop improved
`memory hierarchies that limit the overhead of a memory
`hierarchy without also reducing bandwidth efficiency and
`utilization.
`
`4
`processor memory and Supplies the data to the computa
`tional unit, and where the computational unit and the data
`access unit are configured by a program.
`The present invention also involves a reconfigurable
`processor that includes a first memory of a first type and a
`data prefetch unit coupled to the memory, where the data
`prefetch unit retrieves data from a second memory of a
`second type different from the first type, and the first and
`second memory types and the data prefetch unit are config
`ured by a program.
`Another embodiment of the invention includes a recon
`figurable hardware system that includes a common memory,
`also referred to as external memory, and one or more
`reconfigurable processors coupled to the common memory,
`where at least one of the reconfigurable processors includes
`a data prefetch unit to read and write data between the unit
`and the common memory, and where the data prefetch unit
`is configured by a program executed on the system.
`Another embodiment of the invention includes a method
`of transferring data that includes transferring data between a
`memory and a data prefetch unit in a reconfigurable pro
`cessor, transferring data between the prefetch unit and a data
`access unit, and transferring the data between a computa
`tional unit and the data access unit, where the computational
`unit, data access unit and the data prefetch unit are config
`ured by a program.
`Additional embodiments of the invention are set forth in
`part in the description that follows, and in part will become
`apparent to those skilled in the art upon examination of the
`following specification, or may be learned by the practice of
`the invention. The advantages of the invention may be
`realized and attained by means of the instrumentalities,
`combinations, compositions, and methods particularly
`pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 shows a reconfigurable processor in which the
`present invention may be implemented;
`FIG. 2 shows computational logic as might be loaded into
`a reconfigurable processor,
`FIG. 3 shows a reconfigurable processor as in FIG. 1, but
`with the addition of data access units;
`FIG. 4 shows a reconfigurable processor as in FIG. 3, but
`with the addition of data prefetch units:
`FIG. 5 shows reconfigurable processor with the inclusion
`of external memory;
`FIG. 6 shows reconfigurable processors with external
`memory and with an intelligent memory controller,
`FIG. 7 shows a reconfigurable processor having a com
`bination of data prefetch units and data access units feeding
`computational logic;
`FIG. 8 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform Strided memory
`references;
`FIG. 9A and FIG.9B show the bandwidth efficiency and
`utilization gains obtained when utilizing a data prefetch unit
`and an intelligent memory controller to perform Subset
`memory references in X-Y plane;
`FIG. 10A and FIG. 10B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform Subset
`memory references in X-Z plane;
`FIG. 11A and FIG. 11B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`SUMMARY OF THE INVENTION
`
`Accordingly, an embodiment of the invention includes a
`reconfigurable processor that includes a computational unit
`and a data access unit coupled to the computational unit,
`where the data access unit retrieves data from an on
`
`65
`
`Intel Exhibit 1001 - 15
`
`

`

`US 7,149,867 B2
`
`5
`unit and an intelligent memory controller to perform Subset
`memory references in Y-Z plane;
`FIG. 12A and FIG. 12B show the bandwidth efficiency
`and utilization gains obtained when utilizing a data prefetch
`unit and an intelligent memory controller to perform Subset
`memory references in a mini-cube:
`FIG. 13 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform indirect memory
`references;
`FIG. 14 shows the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an
`intelligent memory controller to perform Strided memory
`reference together with computation.
`
`DETAILED DESCRIPTION
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`1. Definitions:
`Direct execution logic (DEL)—is an assemblage of
`dynamically reconfigurable functional elements that enables
`a program to establish an optimized interconnection among
`selected functional units in order to implement a desired
`computational, data prefetch and/or data access functionality
`for maximizing the parallelism inherent in the particular
`code.
`Reconfigurable Processor is a computing device that
`contains reconfigurable components such as FPGAs and
`can, through reconfiguration, instantiate an algorithm as
`hardware.
`Reconfigurable Logic—is composed of an interconnec
`tion of functional units, control, and storage that implements
`an algorithm and can be loaded into a Reconfigurable
`Processor.
`Functional Unit is a set of logic that performs a specific
`operation. The operation may for example be arithmetic,
`logical, control, or data movement. Functional units are used
`as building blocks of reconfigurable logic.
`Macro is another name for a functional unit.
`Memory Hierarchy is a collection of memories
`Data prefetch Unit is a functional unit that moves data
`between members of a memory hierarchy. The movement
`may be as simple as a copy, or as complex as an indirect
`indexed strided copy into a unit stride memory.
`Data access Unit is a functional unit that accesses a
`component of a memory hierarchy, and delivers data directly
`to computational logic.
`Intelligent Memory Control Unit is a control unit that
`has the ability to select data from its storage according to a
`variety of algorithms that can be selected by a data requester,
`Such as a data prefetch unit.
`50
`Bandwidth Efficiency is defined as the percentage of
`contributory data transferred between two points. Contribu
`tory data is data that actually participates in the recipients
`processing.
`Bandwidth Utilization is defined as the percentage of
`55
`maximum bandwidth between two points that is actually
`used to pass contributory data.
`2. Description
`A reconfigurable processor (RP) 100 implements direct
`executable logic (DEL) to perform computation, as well a
`memory hierarchy for maintaining input data and computa
`tional results. DEL is an assemblage of dynamically recon
`figurable functional elements that enables a program to
`establish an optimized interconnection among selected func
`tional units in order to implement a desired computational,
`data prefetch and/or data access functionality for maximiz
`ing the parallelism inherent in the particular code. The term
`
`40
`
`45
`
`60
`
`65
`
`6
`DEL may also be used to refer to the set of constructs such
`as code, data, configuration variables, and the like that can
`be loaded into RP 100 to cause RP 100 to implement a
`particular assemblage of functional elements.
`FIG. 1 presents an RP 100, which may be implemented
`using field programmable gate arrays (FPGAs) or other
`reconfigurable logic devices, that can be configured and
`reconfigured to contain functional units and interconnecting
`circuits, and a memory hierarchy comprising on-board
`memory banks 104, on-chip block RAM 106, registers
`wires, and a connection 108 to external memory. On-chip
`reconfigurable components 102 create memory structures
`such as registers, FIFOs, wires and arrays using block RAM.
`Dual-ported memory 106 is shared between on-chip recon
`figurable components 102. The reconfigurable processor 100
`also implements user-defined computational logic (e.g., Such
`as DEL 200 shown in FIG. 2) constructed by programming
`an FPGA to implement a particular interconnection of
`computational functional units. In a particular implementa
`tion, a number of RPs 100 are implemented within a
`memory Subsystem of a conventional computer, such as on
`devices that are physically installed in dual inline memory
`module (DIMM) sockets of a computer. In this manner the
`RPs 100 can be accessed by memory operations and so
`coexist well with a more conventional hardware platform. It
`should be noted that, although the exemplary implementa
`tion of the present invention illustrated includes six banks of
`dual ported memory 104 and two reconfigurable compo
`nents 102, any number of memory banks and/or reconfig
`urable components may be used depending upon the par
`ticular implementation or application.
`Any computer program, including complex graphics pro
`cessing programs, word processing programs, database pro
`grams and the like, is a collection of algorithms that interact
`to implement desired functionality. In the common case in
`which static computing hardware resources are used (e.g., a
`conventional microprocessor), the computer program is
`compiled into a set of executable code (i.e., object code)
`units that are linked together to implement the computer
`program on the particular hardware resources. The execut
`able code is generated specifically for a particular hardware
`platform. In this manner, the computer program is adapted
`to conform to the limitations of the static hardware platform.
`However, the compilation process makes many compro
`mises based on the limitations of the static hardware plat
`form.
`Alternatively, an algorithm can be defined in a high level
`language then compiled into DEL. DEL can be produced via
`a compiler from high level programming languages such as
`C or FORTRAN or may be designed using a hardware
`definition language Such as Verilog, VHDL or a schematic
`capture tool. Computation is performed by reconfiguring a
`reconfigurable processor with the DEL and flowing data
`through the computation. In this manner, the hardware
`resources are essentially adapted to conform to the program
`rather than the program being adapted to conform to the
`hardware resources.
`For purposes of this description a single reconfigurable
`processor will be presented first. A sample of computational
`logic 201 is shown in FIG. 2. This simple assemblage of
`functional units performs computation of two results ("A+
`B' and “A+B-(B*C)) from three input variables or operands
`“A”, “B” and “C”. In practice, computational units 201 can
`be implemented to perform very simple or arbitrarily com
`plex computations. The input variables (operands) and out
`put or result variables may be of any size necessary for a
`particular application. Theoretically, any number of oper
`
`Intel Exhibit 1001 - 16
`
`

`

`US 7,149,867 B2
`
`10
`
`15
`
`25
`
`30
`
`7
`ands and result variables may be used/generated by a
`particular DEL. Great complexity of computation can be
`Supported by adding additional reconfigurable chips and
`processors.
`For greatest performance the DEL 200 is constructed as
`parallel pipelined logic blocks composed of computational
`functional units capable of taking data and producing results
`with each clock pulse. The highest possible performance that
`can be achieved is computation of a set of results with each
`clock pulse. To achieve this, data should be available at the
`same rate the computation can consume the data. The rate at
`which data can be supplied to DEL 200 is determined, at
`least in significant part, by the memory bandwidth utiliza
`tion and efficiency. Maximal computational performance
`can be achieved with parallel and pipelined DEL together
`with maximizing the memory bandwidth utilization and
`efficiency. Unlike conventional static hardware platforms,
`however, the memory hierarchy provided in a RP 100 is
`reconfigurable. In accordance with the present invention,
`through the use of data access units and associated memory
`hierarchy components, computational demands and memory
`bandwidth can be matched.
`High memory bandwidth efficiency is achieved when only
`data required for computation is moved within the memory
`hierarchy. FIG.3 shows a simple logic block 300 comprising
`computational functional units 301, control (not shown), and
`data access functional units 303. The data access unit 303
`presents data directly to the computational logic 301. In this
`manner, data is moved from a memory device 305 to the
`computational logic and from the computational logic back
`into a memory device 305 or block RAM memory 307
`within an RP 100.
`FIG. 4 illustrates the logic block 300 with an addition of
`a data prefetch unit 401. The data prefetch unit 401 moves
`35
`data from one member of the memory hierarchy 305 to
`another 308. Data prefetch unit 401 operates independently
`of other functional units 301, 302 and 303 and can therefore
`operate prior to, in parallel with, or after computational
`logic. This independence of operation permits hiding the
`latency associated with obtaining data for use in computa
`tion. The data prefetch unit deposits data into the memory
`hierarchy within RP 100, where computational logic 301,
`302 and 303 can access it through data access units. In the
`example of FIG.4, prefetch unit 401 is configured to deposit
`data into block RAM memory 308. Hence, the prefetch units
`401 may be operated independently of logic block 300 that
`uses prefetched data.
`An important feature of the present invention is that many
`types of data prefetch units can be defined so that the
`prefetch hardware can be configured to conform to the needs
`of the algorithms currently implemented by the computa
`tional logic. The specific characteristics of the prefetch can
`be matched with the needs of the computational logic and
`the format and location of data in the memory hierarchy. For
`example, FIG. 9A and FIG. 9B show an external memory
`that is organized in a 128 byte (16 word) block structure.
`This organization is optimized for Stride 1 access of cache
`based computers. A stride 128 access can result in a very
`inefficient use of bandwidth from the memory, since an extra
`120 bytes of data is moved for every 8 bytes of requested
`data yielding a 6.25% bandwidth efficiency.
`FIG. 5 shows an example of data prefetch in which there
`are no bandwidth gains since all data fetched from external
`memory blocks is also transferred and used in computational
`units 301 through memory bank access units 303. However,
`bandwidth utilization is increased due to the ability of the
`
`55
`
`8
`data prefetch units 501 to initiate a data transfer in advance
`of the requirement for data by computational logic.
`In accordance with an embodiment of the present inven
`tion, data prefetch units 601 are configured to communicate
`with an intelligent memory controller 603 in FIG. 6 and can
`extract only the desired 8 bytes of data, discard the remain
`der of the memory block, and transmit to the data prefetch
`unit only the requested portion of the stride 128 data. The
`prefetch units 601 then delivers that data to the appropriate
`memory components within the memory hierarchy of the
`logic block 300.
`FIG. 6 shows the prefetch units 601 delivering data to the
`RP's onboard memory banks 305. An onboard memory bank
`data access unit 303 then delivers the data to computational
`logic 301 when required. The data prefetch units 501 couple
`with an intelligent memory controller 601 in the implemen
`tation of FIG. 6 that supports a strided reference pattern,
`which yields a 100% bandwidth efficiency in contrast to the
`6.25% efficiency. Although illustrated as a single block of
`external memory, multiple numbers of external memories
`may be employed as well.
`In FIG. 7, the combination of data prefetch units 701 and
`data access units 703 feeding computational logic 301 such
`that bandwidth efficiency and utilization are maximized is
`shown in FIG. 7. In th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket