throbber
Delivering Acceleration:
`The Potential for Increased HPC
`Application Performance
`Using Reconfigurable Logic
`
`David Caliga
`SRC Computers, Inc
`4240 N. Nevada Ave
`Colorado Springs, CO 80907
`719-262-0213
`
`David Peter Barker
`SUPERsmith
`PO Box 2226
`Salinas, CA 93902-2226
`831-442-8343
`
`david.caliga@srccomp.com
`
`dbarker@supersmith.com
`
`ABSTRACT
`SRC Computers, Inc. has integrated adaptive computing into
`its SRC-6 high-end server,
`incorporating reconfigurable
`processors as peers to the microprocessors. Performance
`improvements resulting from reconfigurable computing can
`provide orders of magnitude speedups for a wide variety of
`algorithms. Reconfigurable logic in Field Programmable Gate
`Arrays (FPGAs) has shown great advantage to date in special
`purpose applications and specialty hardware. SRC Computers
`is working to bring this technology into the general purpose
`HPC world via an advanced system interconnect and enhanced
`compiler technology.
`
`Categories and Subject Descriptors
`Compiler technology
`
`General Terms
`Algorithms, Performance, Design, Experimentation
`
`Keywords
`Reconfigurable computing, FPGA
`
`Permission to make digital or hard copies of all or part of
`this work for personal or classroom use is granted without
`fee provided that copies are not made or distributed for
`profit or commercial advantage, and that copies bear this
`notice and the full citation on the first page. To copy
`otherwise, to republish, to post on servers or to redistribute
`to lists, requires prior specific permission and/or a fee.
`
`SC2001 November 2001, Denver (c) 2001 ACM 1-58113-
`293-X/01/0011 $5.00
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 1
`
`

`

`Real Crossbu
`
`T
`
`Wrlta
`
`ar
`
`i'!_"" I I
`
`;Proc
`
`;Pr0c
`
`L2
`
`L2
`
`mot
`
`Private
`Mammy
`
`IPOI-SIM
`
`Commodity p.Process or Board
`
`1. INTRODUCTION
`The SRC-6 system is a unique architecture capable of
`supporting a combination of up to 512 Intel microprocessors
`and 256 Multi-Adaptive Processors (MAP) on a common
`shared memory. A patented SRC crossbar switch design
`supports the connection of these processors with up to 256
`memory ports with each port containing 16 banks of SDRAM
`(see Figure 1). The architectural design of the memory sub-
`system provides flat access latency from any microprocessor or
`reconfigurable processor
`in the SRC-6 system.
`This
`combination will offer in excess of 3 TFlops of theoretical
`peak [256 Pentium 4 processors and 128 MAPs on 32b
`floating-point data]. SRC Computers is pushing forward to
`harness this into sustained performance.
`
`This paper explains how SRC Computers, Inc. has made
`advances
`in
`the
`reconfigurable
`computing
`field
`by
`incorporating FPGA technology at many levels in the SRC-6
`system architecture.
`
`2. RECONFIGURABLE COMPUTING
`DEFINED
`In simplest terms, reconfigurable computing, based on FPGA
`technology,
`could
`be
`defined
`as
`the
`capability
`of
`reprogramming hardware to execute logic that is designed and
`optimized for
`a
`specific user’s
`algorithms. Associated
`compiling technology can provide a transparent method of
`integrating the computational capability of FPGA technology
`and microprocessors into a single application executable code.
`The use of such integrated compiling technologies enables
`reconfigurable architectures to extend beyond the FPGAs and
`the “glueware” that attaches them to the host computer.
`
`Automatic compilation of applications onto reconfigurable
`architectures generates the logic for both the specific hardware
`configuration and also the execution management of the FPGA
`resources. The
`compilation environment must
`also be
`extensible to a wide range of user applications on a single
`system.
`
`Also described is SRC’s patented FPGA-based MAP that user
`applications
`can
`utilize
`to
`deliver
`algorithm-specific
`computational acceleration.
`
`SRC Computers has taken into account all of these factors in
`developing its unique reconfigurable computing capability for
`the SRC-6 system.
`
`Figure 1. Overview of SRC-6 System Architecture
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 2
`
`

`

`l.1II-II-II:Iml M-IIrri:r.,'
`
`3. MULTI-ADAPTIVE PROCESSOR
`(MAP)
`its Multi-Adaptive
`the SRC-6 system is
`The heart of
`Processing feature provided by MAP units (see Figure 2).
`These units utilize hardware-implemented functions, which can
`greatly accelerate
`application algorithms over
`compiler
`implemented instruction sets for microprocessors.
`
`Major architectural characteristics of MAP include:
`
`•
`
`•
`
`Each MAP executes independently of the general-purpose
`processors, including loading and storing needed data,
`after being provided with a list of commands to execute.
`The control of the MAP is done by the application via a
`Command List (COMLIST). The COMLIST contains a
`list of controlling instructions for
`the Control Chip.
`Examples of functions performed by these instructions are
`Direct Memory Access (DMA) reads and writes and the
`execution synchronization with the User Array.
`• MAP units have access to Common Memory (CM).
`•
`Common Memory addresses specified by commands or
`generated by user applications are virtual addresses.
`Addresses
`are
`translated,
`virtual
`to
`physical,
`by
`Translation Look-aside Buffer (TLB) entries with the
`DMA logic of each MAP.
`The User Array portion of a MAP unit is configured for
`algorithmic requirements. This logic can read and write
`on-board memory through multiple ports and can interact
`with the control logic and DMA Engine.
`By means of chain ports, MAP units can communicate
`between
`themselves without
`using
`any memory
`bandwidth. Thus a particular MAP can send partial
`results to another unit, or similarly, can receive such
`partial results from another MAP.
`Full
`system interrupt and semaphore capability is
`available between MAP units or the microprocessors.
`
`•
`
`•
`
`•
`
`Multi-processor applications can easily utilize the MAP by
`identifying the relevant portions of the parallel computational
`algorithms. The application can utilize N microprocessors and
`M MAPs.
`
`4. RECONFIGURATION LOGIC
`PERFORMANCE POTENTIAL
`The attraction of using FPGAs in SRC’s MAP is the ability to
`generate algorithm specific logic, which has the potential of
`orders of magnitude speedups for computationally intensive
`algorithms. There have been many demonstrations showing
`this magnitude of
`speedup in algorithms
`for genetic
`sequencing, encryption/de-encryption,
`string searches and
`integer forms of image and signal processing.
`
`The performance improvement of these algorithms can come
`from any or all of the following:
`
`•
`
`• Memory bandwidth improvements
`- Six ports to memory
`Data flow parallelism
`- Bit-sized data allows for multiple parallel processing
`streams per 64 bits of data read from memory or
`algorithms that may need multiple 32 or 64-bit input
`values
`Computational block level re-scheduling
`- Re-schedule independent computations to be concurrent
`in time
`Instruction Set Architecture (ISA) effectiveness
`- Create operations that are “right-sized” in bits relative to
`the type of data, i.e. 6-bit or 256-bit integer operations
`
`•
`
`•
`
`show potential
`examples
`following
`The
`improvement contributors of an FPGA over
`microprocessor.
`
`performance
`that of a
`
`Figure 2. MAP Block Diagram
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 3
`
`

`

`Example 1:
`
`This example shows a performance comparison between an FPGA and a 900-MHz microprocessor.
`
`Perform an integer add operation on each 4 bits of a data stream. The MAP reads in 64 bits of data from each memory bank.
`The algorithm will use 3 memory banks to read data from the input stream and the logic will create 16 parallel computation
`streams.
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Total
`
`Speedup over
`Microprocessor
`100/900 or 1/9
`3
`16
`1
`10
`
`53.3
`
`Derivation
`
`100 for the FPGA and 900 for the microprocessor
`3 memory banks to read data from input stream
`16 parallel computation streams
`None
`Number of instructions on the microprocessor required to perform
`the equivalent integer operation on the 4-bit data value
`1/9 *3 * 16* 1 * 10 = 53.3
`
`Algorithm Pseudo Code
`
`Segment of Logic Flow
`
`Loop over 4-bit values in 32-bit data value
`(isrc) and add a 4-bit value to input
`value. Store resulting 4-bit value (ires)
`
` len = 4 //4-bit value
` ipos = 29 //start at position 29
`
` Do 100 j = 1, 16
` ibgn = (j-1)* 4 + 1
`
` mvbits (isrc,ibgn,len,itemp,ipos)
`
` ires = idest + t_b4(j)
` //4-bit value stored in ires
`
` mvbits (ires,ipos,len,idest,ibgn)
`100 Continue
`
`J = 0
`
`J = J + 1
`
`Address
`Generator
`Mem Bank1
`
`Data
`I+1
`
`I
`
`Address
`Generator
`Mem Bank2
`
`Data
`I+1
`
`I
`
`Address
`Generator
`Mem Bank3
`
`Data
`I+1
`
`I
`
`t_b*4
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`t_b*4
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`Yes
`
`I4_Add
`
`. . . .
`
`I4_Add
`
`J < End
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 4
`
`

`

`Example 2:
`
`The following example shows a comparison between an FPGA and the Alpha EV6.
`
`Perform a 32-bit floating-point convolution filter on a set of vector data. The convolution filter will be 64 points. The filter will be
`stored in a set of registers in the FPGA chip. Two values of the input data will be read every clock. The output computation rate
`will generate two output values every clock.
`
`Operation
`Read input data values
`Operations every clock
`
`FPGA
`2 values every clock
`128 Multiply-Adds
`
`DEC Alpha EV6
`1 value every clock
`1 Mult and 1 Add
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Speedup over
`Alpha EV6
`100/800 or 1/8
`2
`1
`64
`3/2
`
`Total Speedup
`
`24
`
`Derivation
`
`100 for FPGA and 800 for Alpha EV6
`2 values read every clock
`N/A
`Convolution filter is 64 points
`Number of instructions on the FPGA relative to the microprocessor per
`clock
`1/8 * 2 * 1 * 64 * 3/2 = 24
`
`A segment of the filter code is shown here:
`
`Filter Pseudo Code
`C Loop over output points
`
` Do 100 n = 1, nout
` sum = 0.
`
`C Loop over filter coefficients
` Do 200 j = 1, ncoef
` sum = data_in(n+j-1) * filter(j)
`100
`continue
`
` data_out(n) = sum
`
`100 continue
`
`Logic Flow
`Loop over FIlter Coeficients
`
`sum = 0
`
`j = 0
`
`j = j + 2
`
`AG 1
`
`Data
`I+1
`
`I
`
`Rj+1
`
`sum
`
`Rj
`
`Yes
`
`FpMult
`
`FpMult
`
`FpAdd
`
`FpAdd
`
`j < ncoef
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 5
`
`

`

`Example 3:
`
`This example also shows a comparison between an FPGA and the Pentium 4, using integer data.
`
`Integer arithmetic function units take much less space within an FPGA chip than the floating-point unit. Let’s change the
`previous example to 32-bit integer data. The convolution filter will be 256 points. The filter will be stored in a set of registers in
`the FPGA chip. The input/output data will each be striped across two memory banks. Four values of the input data will be read
`every clock. The output computation rate will generate two output values every clock.
`
`Operation
`Read input data values
`Operations every clock
`
`FPGA
`4 values every clock
`512 Multiply-Adds
`
`Pentium 4
`1 value every clock
`1 Mult or 1 Add
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Speedup over
`Pentium 4
`100/1700 or 1/17
`4
`1
`256
`2
`
`Total Speedup
`
`120.5
`
`Derivation
`
`100 for the FPGA and 1700 for the Pentium 4
`4 values read every clock
`N/A
`Convolution filter is 256 points
`Number of
`instructions on the FPGA relative
`microprocessor per clock
`1/17 * 4 * 1 * 256 * 2 = 120.5
`
`to the
`
`4.1 Amdahl’s Law
`A generally used measure of performance or speedup of an
`algorithm for various architectures is Amdahl’s Law. It has been
`used for vector and parallel systems to show the benefit of
`optimizing portions of code. Amdahl’s Law points out that given
`the percent of time spent
`in a portion of code the overall
`application may get only marginal overall speedup even though
`the algorithm was made to execute much faster. We can apply
`the same principal
`to algorithms moved into MAP.
`Let’s
`examine the previous examples and look at
`the application
`speedup up given various percentages of time spent
`in the
`algorithm.
`
`These examples have shown application speeds for several
`types of integer and floating-point types of problems. The
`benefits of using FPGAs are obviously for algorithms that
`have high performance speedups and high percentages of
`computation time.
`
`MAP Algorithm
`Speedup
`
`Percent of time spent
`in algorithm
`80
`90
`95
`99
`
`Ex. 1
`53x
`
`Ex. 2
`24x
`
`Ex. 3
`120x
`
`Application
`Speedup
`4.3
`7.3
`11.2
`19.5
`
`4.7
`8.5
`14.7
`34.9
`
`4.8
`9.30
`17.3
`54.8
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 6
`
`

`

`5. PROGRAMMING FLOW VS. DATA
`FLOW
`Algorithms that will be moved into the MAP should be thought
`of as data flow problems from the hardware logic perspective.
`The following example will look at the CAXPY algorithm
`[def: A*X(j) + Y(j) = Z(j)].
`The traditional way for a
`programmer to think of CAXPY is as a loop over the number
`of elements in arrays X and Y. However, in hardware, it is
`
`thought of as a data flow through logic. The logic will read
`values for X and Y every clock and send the values through the
`logic definition to the set of FpMults and FpAdds and create a
`value for Z every clock.
`
`The compiler, as part of its analysis of code segments, creates a
`data flow graph and performs dependency analysis. This
`information can then be used to create an algorithm data flow
`that will be put into hardware logic for the FPGAs.
`
`Algorithm Programming Flow
`
`Algorithm Data Flow
`
`address0, stride, count
`
`address0, stride, count
`
`M_Bank
`1
`
`/ 64
`
`AG 1
`
`AG 2
`
`end_of_data_flag
`
`end_of_data_flag
`
`M_Bank
`2
`
`/ 64
`
`Xr(j)
`/ 32
`
`Xi(j)
`/ 32
`
`Ar
`
`Ai
`
`Yr(j)
`
`/ 32
`
`Yi(j)
`
`/ 32
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`/ 32
`
`/ 32
`
`Zr(j)
`
`Zi(j)
`
`address0, stride, count
`
`/ 64
`
`M_Bank
`3
`
`AG 3
`
`J = 0
`
`J = J + 1
`
`AG 1
`
`AG 2
`
`X
`
`/ 64
`
`Xr(j)
`/ 32
`
`Xi(j)
`/ 32
`
`Ar
`
`Ai
`
`Y
`
`/ 64
`
`Yr(j)
`
`/ 32
`
`Yi(j)
`
`/ 32
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`Yes
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`/ 32
`
`/ 32
`
`Zr(j)
`
`Zi(j)
`
`AG 3
`
`/ 64
`
`Z
`
`J < N
`
`END
`
`MEM 1
`X(j)
`
`MEM 2
`Y(j)
`
`MEM 3
`Z(j)
`
`MEM 4
`
`MEM 5
`
`MEM 6
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 7
`
`

`

`Output
`Data
`
`Output
`Data
`
`Phase 3
`
`1
`clk
`
`Output
`Data
`
`Output
`Data
`
`Function Unit
`
`Phase 2
`
`8
`clks
`
`Function Unit
`
`Elapsed clock - 0
`
`Function Unit
`
`Elapsed clock - 1
`
`Function Unit
`
`Data 1
`
`Data 2
`
`Elapsed clock - 3
`
`Function Unit
`
`Phase 1
`
`1
`clk
`
`Data 1
`
`Data 3
`
`Output
`Data
`
`Data 1
`
`Data 2
`
`Data 3
`
`Data 4
`
`Data 5
`
`Data 6
`
`Data 7
`
`Data 8
`
`Data 9
`
`Elapsed clock - 9
`
`Function Unit
`
`Output
`Data
`
`Data 1
`
`Data 2
`
`Data 3
`
`Data 4
`
`Data 5
`
`Data 6
`
`Data 7
`
`Data 8
`
`Data 9
`
`Data
`
`10
`
`Data
`
`11
`
`Elapsed clock - 11
`
`Input
`Data
`
`Input
`Data
`
`Data 1
`
`Input
`Data
`
`Data 2
`
`Input
`Data
`
`Data 4
`
`Input
`Data
`
`Data
`
`10
`
`Input
`Data
`
`Data
`
`12
`
`Figure 3. Data Flow through Pipelined Function Unit
`
`The arithmetic function units instantiated in the FPGAs have a
`pipeline design. This means that a new data value can be input
`into the function unit every clock.
`There is a latency
`associated with each function unit before a final result
`is
`produced from the input data value. After the first data value
`comes out, subsequent output values will come out every
`clock. Assume that the latency for an FpMult and FpAdd are
`both 10 clocks.
`
`Figure 3 shows how data flows through a pipelined function
`unit.
`In addition, the figure shows a function unit that has
`multiple phases. Floating-point units often have pre- and post-
`processing for format conversion, i.e., conversion from IEEE
`Floating-Point representation into an internal representation.
`
`Going back to the CAXPY example, Table 1 shows how many
`clocks it will take for data samples to pass through stages of
`the hardware logic.
`
`The processing time of hardware logic is similar to that of very
`long vectors on vector processor systems. The time in clocks
`to process a set of data, of length Nelem, through CAXPY
`would be:
`
`Time (clocks)
`
`= Nelems + Latency
`= Nelems + 35 clocks
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 8
`
`

`

`Table 1. Data Sample Processing Latency
`
`Data Flow Logic
`
`Clock
`2 … 13 … 24 … 35
`
`36
`
`1
`
`address0, stride, count
`
`address0, stride, count
`
`M_Bank1
`
`[0:31]
`
`[32:64]
`
`Address
`Generator 1
`
`Address
`Generator 2
`
`end_of_data_flag
`
`end_of_data_flag
`
`M_Bank2
`
`[0:31]
`
`[32:64]
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`[0:31]
`
`[32:64]
`
`address0, stride, count
`
`M_Bank3
`
`Address
`Generator 3
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 5
`
`D 6
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 5
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 1
`
`D 2
`
`D 3
`
`D 1
`
`D 2
`
`6. DESIGNING ALGORITHMS FOR MAP
`One of the major issues that have hindered the use of FPGAs to
`date for general-purpose scientific algorithms has been the lack
`of an ability to use floating-point arithmetic. This impediment
`has come from two related factors.
`
`The first has been the actual size, gate count, of FPGAs. Until
`recently, the size of FPGAs has been under 1M gates. The
`advent of multi-million gate FPGAs has dramatically changed
`the possibility of the type of logic than can be loaded into a
`single FPGA chip.
`
`The second limitation has been the size of floating-point (32-
`or 64-bit) functional units for FPGAs. The number of gates
`required to define IEEE-754 compliant function units is much
`greater than those for integer function units.
`
`Table 2 shows the approximate number of function units that
`can be defined in an SRC MAP III.
`
`Another consideration is the amount of the algorithm logic that
`will fit in a single FPGA chip. The MAP environment has the
`capability to extend the algorithm across multiple FPGA chips
`and multiple MAPs. The MAP User Array interconnect
`provides the ability to transfer 24 bytes per clock. A transfer
`can be configured dependent upon the data type being
`transferred. A transfer could send three 8-byte elements, six 4-
`byte elements, twenty-four 1-byte elements, etc. The transfer
`between MAPs via the Chain Ports can send 12 bytes per
`clock.
`This MAP interconnect does not use any switch
`bandwidth.
`
`Table 2. Function Count for MAP III
`
`64b
`
`Function Unit Type
`Integer, 32b
`Multiply
`Add
`Multiply
`Add
`Multiply
`Add
`Multiply
`Add
`
`Floating Point, 32b
`
`64b
`
`Function Count
`140
`3000
`36
`1570
`48
`108
`24
`48
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 9
`
`

`

`7. SRC COMPILER TECHNOLOGY AND
`TOOLS
`SRC is extending a well-known vendor’s FORTRAN and C
`compilers to target compilation of computationally intensive
`portions of applications for FPGAs. Part of the compilation
`process is the generation of a data flow graph (DFG) and the
`dependency analysis of the application code. The DFG and
`dependency analysis are used to define the layout of
`programmable logic in the FPGAs and data layout in on-board
`memory. Two execution-time components will be generated.
`The first is the hardware logic defined from the data and
`control flow analysis. The second is the definition of a
`“Command List”, or COMLIST, for the execution control of
`the User Array environment and issuance of direct memory
`access (DMA) instructions.
`
`Current hardware design can be accomplished through the use
`of Hardware Definition Languages (HDL). An HDL is a high
`level language similar in concept to a software high level
`language such as C. The process of converting the HDL into a
`hardware design consisting of wires and transistors etc.
`is
`called “synthesis”. Synthesis is somewhat analogous to normal
`software compilation.
`
`There are a great number of tools and products available that
`work with various HDL. The compilation strategy for the MAP
`is to leverage the ready availability of synthesis tools etc.
`available for HDL (see Figure 4). The basic premise is that
`compilation for the MAP shall consist of translating the
`software language of the user,( i.e. C or FORTRAN) into an
`HDL representation. This HDL representation can readily be
`processed by existing hardware design toolsets into the
`bitstream used to configure the User Logic (U_Logic).
`
`The
`language is VERILOG.
`The chosen HDL target
`VERILOG language allows representation of hardware design
`at various levels of abstraction. At
`the lowest
`level,
`it
`is
`possible to write VERILOG code that represents individual
`gates and wires. At the opposite extreme, one can write so-
`called “behavioral” code, which has a high degree of
`abstraction. In fact, it is possible to write behavioral code that
`cannot be synthesized into hardware. Such code is useful for
`the purposes of simulation.
`
`It is not the intent of the compiling strategy that the synthesis
`of VERILOG shall
`include the synthesis of floating-point
`operations etc. One of the primary constraints for the compiler
`is compile time. The user of
`the compiler will have
`expectations that the compilation should be done quickly. This
`presents a significant challenge in the “place and route”
`process. Since the compiler will be using existing commercial
`tools, the compiling system must function in such a way as to
`make the job of these tools as easy as possible.
`
`the compiler-generated
`To facilitate the synthesis process,
`VERILOG code shall consist solely of
`instantiation and
`hookup of an existing set of pre-defined VERILOG modules.
`A VERILOG module is similar in concept
`to a software
`subprogram, or subroutine. Importantly, experienced hardware
`designers at a very low design level can create these pre-
`defined modules, in a pre-placed manner. This makes them
`efficient in terms of speed and resource usage, and, since they
`are “relationally placed”, the place and route portion of the
`final synthesis can proceed faster.
`
`The number and type of pre-defined modules is relatively
`small. Just as normal software compilation builds the behavior
`of the user’s program from a small set of basic op-codes, so too
`can the U_Logic configuration be built up from a finite set of
`pre-defined modules,
`such as
`integer/floating-point add,
`multiply, divide, etc..
`
`The development of domain-specific modules (e.g. Convolve,
`FFT) that can be used as building blocks in the design of
`FPGA-based programs is a necessity. These modules provide
`a bridge between the algorithm developer and the hardware
`logic “designer”. The module will have a counterpart in an
`FPGA macro library. These macros will be incorporated into
`the DFG and used in the hardware definition language (HDL).
`Examples of potential libraries for the generation of macros are
`signal and image processing and linear algebra routines.
`In
`addition, application or user-specific modules can be defined
`and added to an FPGA macro library set.
`
`The ability of the user to optionally manipulate the compiler-
`generated logic is extremely important. This manipulation step
`must be available for those customers that want to get the last
`“drop” of algorithmic performance out of the hardware. The
`DFG generated by the high-level language (HLL) compiler
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Figure 4 . Diagram of the SRC Compiler Process
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 10
`
`

`

`will be an optional output. This graph will be modifiable in a
`DFG Editor
`to
`provide
`potentially
`higher
`levels
`of
`optimization.
`
`The on-board memory of the MAP will be used like a
`software-controlled
`cache
`for
`a more
`conventional
`microprocessor. An added bonus is that this “cache” can have
`different user-defined caching strategies. The data access and
`compute strategies will maximize temporal
`locality with
`respect to the use of resident, on-board data and will overlap
`computation on this data with streaming in the “working set”
`of data for the next phase of the computation. We will generate
`both the Common Memory DMA instructions and the on-board
`data layout from the data-flow/dependence analysis.
`
`the
`rapid prototyping of
`the provision for
`Furthermore,
`algorithm in MAP is a desirable tool for the optimization
`process. We are investigating potential
`tools for DFG
`translation
`for
`input
`to
`commercial
`packages;
`e.g.,
`MATLAB/Simulink could provide easy access to prototyping
`and optimizations that a wide variety of people use today.
`
`The available resources of the U_logic FPGA chips in the
`MAP are a finite resource. It will be a simple matter to compile
`user code into a final U_Logic configuration that exceeds the
`physical resources available in the chips. There will obviously
`have to be some feedback from various portions of the
`synthesis tools to the compilation system. The compilation
`system will also have to apply various heuristics in regards to
`precisely what portions of the user code should be targeted for
`the MAP in the first place, and what sort of transformations
`might be applied to the user code to allow for optimal
`performance. The feedback from the synthesis tools and the
`compilation heuristics can only be determined from direct
`experience. Initial versions of the compiler will necessarily
`omit this functionality. Adding this functionality based on the
`learned experience represents a significant portion of the
`overall effort required to create a mature compiler.
`
`8. ALGORITHM STUDIES
`The benefit of using reconfigurable computing has been shown
`in many domain areas[1][2][5]. Several algorithms will be
`reviewed to show the performance improvements over RISC
`and microprocessor-based systems.
`The definition of the
`hardware logic for these algorithms started with DFGs from
`
`The MAP compiler generates an
`our HLL compiler.
`it uses for the generation of HDL.
`equivalent DFG that
`Algorithm performance on MAP is from hardware simulation
`of
`the logic.
`In addition, optimization techniques and
`experiences will be discussed relative to compiler generated
`HDL.
`
`The SRC-developed compiler is in a prototype stage of HDL
`generation. The compiler has demonstrated the ability to
`import pre-defined function unit modules and generate logic
`with correct
`results.
`Development on the compiler
`is
`proceeding in a directed, methodical manner. It does not yet
`fully compile the discussed cases. It can be expected that
`compilation of the following cases will be achieved by the time
`of the presentation of this paper at the SC01 conference.
`
`8.1 Case 1: Convolution – Zero Pha se Filter
`This algorithm has been discussed briefly earlier in the paper.
`Let’s examine the 32b floating-point convolution problem with
`a 64-point zero phase filter.
`The critical challenge for the
`compiler is to know how to take a simple convolution code and
`define a method for the scheduling of operations to take
`advantage of
`the opportunity to use a large number of
`multiply/adds relative to traditional processor implementation.
`The MAP can read up to 384b of data from on-board memory
`every clock.
`
`The existing compiler’s HDL implementation would utilize
`only a single input data element every clock. The performance
`of this level of optimization would be 2 Flops/clock. The MAP
`logic can consume the two 32b input data values every clock.
`If the logic can process only one element every clock, then the
`logic has to stall for a clock in order to consume the second
`input data value. Therefore, the processing rate can only create
`an output data element every other clock.
`
`In order to optimize performance relative to the reading of the
`input data, hardware logic was developed that loaded the data
`appropriately into two sets of shift registers for the even and
`odd data elements. Two processing streams were defined to
`process the even and odd input data elements.
`
`Figure 5 shows how the input data is loaded into the shift
`registers for the even/odd input data values. The shift registers
`are loaded up from the read of the first thirty-two 64b data
`values from memory.
`
`input data
`
`o
`
`e
`
`Shift Register for Odds
`
`1
`
`3
`
`5
`
`7
`
`9
`
`. . .
`
`57
`
`59
`
`61
`
`63
`
`64
`
`62
`
`60
`
`58
`
`Shift Register for Evens
`. . .
`
`56
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Fill Direction
`
`Fill Direction
`
`Figure 5. Data Streams Going into Shift Registers
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 11
`
`

`

`After the shift registers are full, the multiplication of the data
`values with the filter coefficients can start. The process is
`pipelined so that the next input values will go into the shift
`registers. The generation of the output points is shown in
`Figure 6 as the contribution of the two shift registers.
`
`This strategy is similar to loop unrolling that the compiler can
`often generate. We are in the process of developing heuristics
`
`for the compiler so that it can automatically generate this level
`of optimization strategy.
`
`The computational performance on MAP for this convolution
`algorithm will be 256 flops/clock.
`This compares to 2
`flops/clock on the Alpha EV6.
`
`Shift Register - Odd Numbered Values
`
`Shift Register - Even Numbered
`Values
`
`Output
`Point
`
`2
`
`0
`
`1
`
`1
`
`3
`
`4
`
`2
`
`0
`
`0
`
`1
`
`1
`
`3
`
`5
`
`7
`
`9
`
`. . .
`
`57
`
`59
`
`61
`
`63
`
`64
`
`62
`
`60
`
`58
`
`56
`
`. . .
`
`8
`
`6
`
`4
`
`2
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`x
`
`x
`
`F60
`
`x
`
`F62
`
`x
`
`F64
`
`x
`
`x
`
`F63
`
`x
`
`F61
`
`x
`
`F59
`
`x
`
`F57
`
`x
`
`x F
`9
`
`x F
`7
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`Clock
`
`1
`
`2
`
`32
`
`F56
`
`F58
`
`2
`
`3
`
`33
`
`3
`
`5
`
`7
`
`9
`
`11
`
`. . .
`
`59 61
`
`63
`
`65
`
`66
`
`64
`
`62
`
`60
`
`58
`
`. . .
`
`10
`
`8
`
`6
`
`4
`
`2
`
`F1
`
`xF
`3
`
`xF
`9
`
`xF
`7
`
`xF
`5
`
`x
`
`F57
`
`x
`
`F59
`
`x
`
`F61
`
`x
`
`F63
`
`x
`
`x
`
`F64
`
`x
`
`F62
`
`x
`
`F60
`
`x
`
`F58
`
`x
`
`F10
`
`x
`
`x F
`8
`
`F6
`
`x
`
`F4
`
`xF
`2
`
`9
`
`. . .
`
`59
`
`61
`
`63
`
`65
`
`66
`
`64
`
`62
`
`60
`
`58
`
`8
`
`6
`
`3
`
`5
`
`7
`
`11
`
`. . .
`
`10
`
`4
`
`2
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`F56
`
`x
`
`F58
`
`x
`
`F60
`
`x
`
`F62
`
`x
`
`F64
`
`x
`
`x
`
`F63
`
`x
`
`F61
`
`x
`
`F59
`
`x
`
`F57
`
`x
`
`x F
`9
`
`x F
`7
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`34
`
`5
`
`7
`
`9
`
`11
`
`13
`
`. . .
`
`61
`
`63
`
`65
`
`67
`
`68
`
`66
`
`64
`
`62
`
`60
`
`. . .
`
`12
`
`10
`
`8
`
`6
`
`4
`
`4
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`F1
`
`xF
`3
`
`xF
`9
`
`xF
`7
`
`xF
`5
`
`x
`
`F57
`
`F59
`
`F61
`
`F63
`
`F64
`
`F62
`
`F60
`
`F58
`
`x
`
`F10
`
`x
`
`x F
`8
`
`F6
`
`x
`
`F4
`
`xF
`2
`
`5
`
`7
`
`9
`
`11
`
`13
`
`. . .
`
`61
`
`63
`
`65
`
`67
`
`68
`
`66
`
`64
`
`62
`
`60
`
`. . .
`
`12
`
`10
`
`8
`
`6
`
`4
`
`5
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`F56
`
`F58
`
`F60
`
`F62
`
`F64
`
`x
`x
`x
`x
`x
`x
`x
`x
`x
`Figure 6. Processing Streams for Data
`
`F63
`
`F61
`
`F59
`
`F57
`
`x
`
`x F
`7
`
`x F
`9
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`Figure 6. Processing Streams for Data
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 12
`
`

`

`8.2 Case 2: Routine SCORE from MAXSEGS
`MAXSEGS is a Smith-Waterman implementation of a genetic
`sequence alignment algorithm. The focal point of this analysis
`has been on the routine, SCORE. The routine, developed by
`Alex Ropelewski[4] of Pittsburgh Supercomputing Center,, is a
`fully vectorized version of the Waterman-Eggert dynamic
`program algorithm. The computationally intensive portion of
`SCORE is shown in the following code segment.
`
`do 200 k=1,ls1+ls2-1,1
` i=min(k,ls2)
` j=max(1,(k+1-ls2))
`
` do 210 index = start(k),end(k),ls2
` left = index-(ls2+1)
` diag = (index-(ls2+1))-1
` up = index-1
`
` extgvi(i)= max(0,extgvi(i)+gap,
` simils(left)+gap+newgap)
` extgvj(j)= max(0,extgvj(j)+gap,
` simils(up)+gap+newgap)
`
` simils(index)=max(simils(diag)+
` wt(s1(j),s2(i)),
` extgvi(i),extgvj(j),0)
` i=i-1
` j=j+1
`210 continue
`200 continue
`
`The goal for any algorithm in the MAP is to create a data flow
`that will maximize the use of input data and computational
`results within a single pass of the logic flow. The challenge
`with this code is to create a data flow that would maximize the
`use of 32-bit data coming from the input arrays EXTGVI,
`EXTGVJ and SIMILS. As explained in Ropelewski et al., the
`key lies in the diagonal access patterns of the algorithm.
`
`Arrays are accessed across diagonals of the matrix in this
`fashion:
`
`Values of start and end shown in the code segment are the
`starting and ending diagonal values − (1,1) and (4,4),
`respectively.
`
`The nested loops are unrolled by a factor of two to utilize the
`two 32-bit values read from on-board memory in the MAP. A
`macro can be made for
`the inner
`loop computation of
`EXTGVI, EXTGVJ and SIMILS. The logic flow for the
`computational macro is shown in Figure 7.
`
`The hardware implementation of SCORE is pipelined (see
`Figure 8). The execution of the SIMILS macro can take a new
`value every clock and generate the values

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket