`The Potential for Increased HPC
`Application Performance
`Using Reconfigurable Logic
`
`David Caliga
`SRC Computers, Inc
`4240 N. Nevada Ave
`Colorado Springs, CO 80907
`719-262-0213
`
`David Peter Barker
`SUPERsmith
`PO Box 2226
`Salinas, CA 93902-2226
`831-442-8343
`
`david.caliga@srccomp.com
`
`dbarker@supersmith.com
`
`ABSTRACT
`SRC Computers, Inc. has integrated adaptive computing into
`its SRC-6 high-end server,
`incorporating reconfigurable
`processors as peers to the microprocessors. Performance
`improvements resulting from reconfigurable computing can
`provide orders of magnitude speedups for a wide variety of
`algorithms. Reconfigurable logic in Field Programmable Gate
`Arrays (FPGAs) has shown great advantage to date in special
`purpose applications and specialty hardware. SRC Computers
`is working to bring this technology into the general purpose
`HPC world via an advanced system interconnect and enhanced
`compiler technology.
`
`Categories and Subject Descriptors
`Compiler technology
`
`General Terms
`Algorithms, Performance, Design, Experimentation
`
`Keywords
`Reconfigurable computing, FPGA
`
`Permission to make digital or hard copies of all or part of
`this work for personal or classroom use is granted without
`fee provided that copies are not made or distributed for
`profit or commercial advantage, and that copies bear this
`notice and the full citation on the first page. To copy
`otherwise, to republish, to post on servers or to redistribute
`to lists, requires prior specific permission and/or a fee.
`
`SC2001 November 2001, Denver (c) 2001 ACM 1-58113-
`293-X/01/0011 $5.00
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 1
`
`
`
`Real Crossbu
`
`T
`
`Wrlta
`
`ar
`
`i'!_"" I I
`
`;Proc
`
`;Pr0c
`
`L2
`
`L2
`
`mot
`
`Private
`Mammy
`
`IPOI-SIM
`
`Commodity p.Process or Board
`
`1. INTRODUCTION
`The SRC-6 system is a unique architecture capable of
`supporting a combination of up to 512 Intel microprocessors
`and 256 Multi-Adaptive Processors (MAP) on a common
`shared memory. A patented SRC crossbar switch design
`supports the connection of these processors with up to 256
`memory ports with each port containing 16 banks of SDRAM
`(see Figure 1). The architectural design of the memory sub-
`system provides flat access latency from any microprocessor or
`reconfigurable processor
`in the SRC-6 system.
`This
`combination will offer in excess of 3 TFlops of theoretical
`peak [256 Pentium 4 processors and 128 MAPs on 32b
`floating-point data]. SRC Computers is pushing forward to
`harness this into sustained performance.
`
`This paper explains how SRC Computers, Inc. has made
`advances
`in
`the
`reconfigurable
`computing
`field
`by
`incorporating FPGA technology at many levels in the SRC-6
`system architecture.
`
`2. RECONFIGURABLE COMPUTING
`DEFINED
`In simplest terms, reconfigurable computing, based on FPGA
`technology,
`could
`be
`defined
`as
`the
`capability
`of
`reprogramming hardware to execute logic that is designed and
`optimized for
`a
`specific user’s
`algorithms. Associated
`compiling technology can provide a transparent method of
`integrating the computational capability of FPGA technology
`and microprocessors into a single application executable code.
`The use of such integrated compiling technologies enables
`reconfigurable architectures to extend beyond the FPGAs and
`the “glueware” that attaches them to the host computer.
`
`Automatic compilation of applications onto reconfigurable
`architectures generates the logic for both the specific hardware
`configuration and also the execution management of the FPGA
`resources. The
`compilation environment must
`also be
`extensible to a wide range of user applications on a single
`system.
`
`Also described is SRC’s patented FPGA-based MAP that user
`applications
`can
`utilize
`to
`deliver
`algorithm-specific
`computational acceleration.
`
`SRC Computers has taken into account all of these factors in
`developing its unique reconfigurable computing capability for
`the SRC-6 system.
`
`Figure 1. Overview of SRC-6 System Architecture
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 2
`
`
`
`l.1II-II-II:Iml M-IIrri:r.,'
`
`3. MULTI-ADAPTIVE PROCESSOR
`(MAP)
`its Multi-Adaptive
`the SRC-6 system is
`The heart of
`Processing feature provided by MAP units (see Figure 2).
`These units utilize hardware-implemented functions, which can
`greatly accelerate
`application algorithms over
`compiler
`implemented instruction sets for microprocessors.
`
`Major architectural characteristics of MAP include:
`
`•
`
`•
`
`Each MAP executes independently of the general-purpose
`processors, including loading and storing needed data,
`after being provided with a list of commands to execute.
`The control of the MAP is done by the application via a
`Command List (COMLIST). The COMLIST contains a
`list of controlling instructions for
`the Control Chip.
`Examples of functions performed by these instructions are
`Direct Memory Access (DMA) reads and writes and the
`execution synchronization with the User Array.
`• MAP units have access to Common Memory (CM).
`•
`Common Memory addresses specified by commands or
`generated by user applications are virtual addresses.
`Addresses
`are
`translated,
`virtual
`to
`physical,
`by
`Translation Look-aside Buffer (TLB) entries with the
`DMA logic of each MAP.
`The User Array portion of a MAP unit is configured for
`algorithmic requirements. This logic can read and write
`on-board memory through multiple ports and can interact
`with the control logic and DMA Engine.
`By means of chain ports, MAP units can communicate
`between
`themselves without
`using
`any memory
`bandwidth. Thus a particular MAP can send partial
`results to another unit, or similarly, can receive such
`partial results from another MAP.
`Full
`system interrupt and semaphore capability is
`available between MAP units or the microprocessors.
`
`•
`
`•
`
`•
`
`Multi-processor applications can easily utilize the MAP by
`identifying the relevant portions of the parallel computational
`algorithms. The application can utilize N microprocessors and
`M MAPs.
`
`4. RECONFIGURATION LOGIC
`PERFORMANCE POTENTIAL
`The attraction of using FPGAs in SRC’s MAP is the ability to
`generate algorithm specific logic, which has the potential of
`orders of magnitude speedups for computationally intensive
`algorithms. There have been many demonstrations showing
`this magnitude of
`speedup in algorithms
`for genetic
`sequencing, encryption/de-encryption,
`string searches and
`integer forms of image and signal processing.
`
`The performance improvement of these algorithms can come
`from any or all of the following:
`
`•
`
`• Memory bandwidth improvements
`- Six ports to memory
`Data flow parallelism
`- Bit-sized data allows for multiple parallel processing
`streams per 64 bits of data read from memory or
`algorithms that may need multiple 32 or 64-bit input
`values
`Computational block level re-scheduling
`- Re-schedule independent computations to be concurrent
`in time
`Instruction Set Architecture (ISA) effectiveness
`- Create operations that are “right-sized” in bits relative to
`the type of data, i.e. 6-bit or 256-bit integer operations
`
`•
`
`•
`
`show potential
`examples
`following
`The
`improvement contributors of an FPGA over
`microprocessor.
`
`performance
`that of a
`
`Figure 2. MAP Block Diagram
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 3
`
`
`
`Example 1:
`
`This example shows a performance comparison between an FPGA and a 900-MHz microprocessor.
`
`Perform an integer add operation on each 4 bits of a data stream. The MAP reads in 64 bits of data from each memory bank.
`The algorithm will use 3 memory banks to read data from the input stream and the logic will create 16 parallel computation
`streams.
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Total
`
`Speedup over
`Microprocessor
`100/900 or 1/9
`3
`16
`1
`10
`
`53.3
`
`Derivation
`
`100 for the FPGA and 900 for the microprocessor
`3 memory banks to read data from input stream
`16 parallel computation streams
`None
`Number of instructions on the microprocessor required to perform
`the equivalent integer operation on the 4-bit data value
`1/9 *3 * 16* 1 * 10 = 53.3
`
`Algorithm Pseudo Code
`
`Segment of Logic Flow
`
`Loop over 4-bit values in 32-bit data value
`(isrc) and add a 4-bit value to input
`value. Store resulting 4-bit value (ires)
`
` len = 4 //4-bit value
` ipos = 29 //start at position 29
`
` Do 100 j = 1, 16
` ibgn = (j-1)* 4 + 1
`
` mvbits (isrc,ibgn,len,itemp,ipos)
`
` ires = idest + t_b4(j)
` //4-bit value stored in ires
`
` mvbits (ires,ipos,len,idest,ibgn)
`100 Continue
`
`J = 0
`
`J = J + 1
`
`Address
`Generator
`Mem Bank1
`
`Data
`I+1
`
`I
`
`Address
`Generator
`Mem Bank2
`
`Data
`I+1
`
`I
`
`Address
`Generator
`Mem Bank3
`
`Data
`I+1
`
`I
`
`t_b*4
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`t_b*4
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`(0:3)
`
`(60:63)
`(56:59)
`
`(4:7)
`
`Yes
`
`I4_Add
`
`. . . .
`
`I4_Add
`
`J < End
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 4
`
`
`
`Example 2:
`
`The following example shows a comparison between an FPGA and the Alpha EV6.
`
`Perform a 32-bit floating-point convolution filter on a set of vector data. The convolution filter will be 64 points. The filter will be
`stored in a set of registers in the FPGA chip. Two values of the input data will be read every clock. The output computation rate
`will generate two output values every clock.
`
`Operation
`Read input data values
`Operations every clock
`
`FPGA
`2 values every clock
`128 Multiply-Adds
`
`DEC Alpha EV6
`1 value every clock
`1 Mult and 1 Add
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Speedup over
`Alpha EV6
`100/800 or 1/8
`2
`1
`64
`3/2
`
`Total Speedup
`
`24
`
`Derivation
`
`100 for FPGA and 800 for Alpha EV6
`2 values read every clock
`N/A
`Convolution filter is 64 points
`Number of instructions on the FPGA relative to the microprocessor per
`clock
`1/8 * 2 * 1 * 64 * 3/2 = 24
`
`A segment of the filter code is shown here:
`
`Filter Pseudo Code
`C Loop over output points
`
` Do 100 n = 1, nout
` sum = 0.
`
`C Loop over filter coefficients
` Do 200 j = 1, ncoef
` sum = data_in(n+j-1) * filter(j)
`100
`continue
`
` data_out(n) = sum
`
`100 continue
`
`Logic Flow
`Loop over FIlter Coeficients
`
`sum = 0
`
`j = 0
`
`j = j + 2
`
`AG 1
`
`Data
`I+1
`
`I
`
`Rj+1
`
`sum
`
`Rj
`
`Yes
`
`FpMult
`
`FpMult
`
`FpAdd
`
`FpAdd
`
`j < ncoef
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 5
`
`
`
`Example 3:
`
`This example also shows a comparison between an FPGA and the Pentium 4, using integer data.
`
`Integer arithmetic function units take much less space within an FPGA chip than the floating-point unit. Let’s change the
`previous example to 32-bit integer data. The convolution filter will be 256 points. The filter will be stored in a set of registers in
`the FPGA chip. The input/output data will each be striped across two memory banks. Four values of the input data will be read
`every clock. The output computation rate will generate two output values every clock.
`
`Operation
`Read input data values
`Operations every clock
`
`FPGA
`4 values every clock
`512 Multiply-Adds
`
`Pentium 4
`1 value every clock
`1 Mult or 1 Add
`
`Feature
`
`Clock rate
`Memory bandwidth
`Data flow parallelism
`Block level re-scheduling
`ISA effectiveness
`
`Speedup over
`Pentium 4
`100/1700 or 1/17
`4
`1
`256
`2
`
`Total Speedup
`
`120.5
`
`Derivation
`
`100 for the FPGA and 1700 for the Pentium 4
`4 values read every clock
`N/A
`Convolution filter is 256 points
`Number of
`instructions on the FPGA relative
`microprocessor per clock
`1/17 * 4 * 1 * 256 * 2 = 120.5
`
`to the
`
`4.1 Amdahl’s Law
`A generally used measure of performance or speedup of an
`algorithm for various architectures is Amdahl’s Law. It has been
`used for vector and parallel systems to show the benefit of
`optimizing portions of code. Amdahl’s Law points out that given
`the percent of time spent
`in a portion of code the overall
`application may get only marginal overall speedup even though
`the algorithm was made to execute much faster. We can apply
`the same principal
`to algorithms moved into MAP.
`Let’s
`examine the previous examples and look at
`the application
`speedup up given various percentages of time spent
`in the
`algorithm.
`
`These examples have shown application speeds for several
`types of integer and floating-point types of problems. The
`benefits of using FPGAs are obviously for algorithms that
`have high performance speedups and high percentages of
`computation time.
`
`MAP Algorithm
`Speedup
`
`Percent of time spent
`in algorithm
`80
`90
`95
`99
`
`Ex. 1
`53x
`
`Ex. 2
`24x
`
`Ex. 3
`120x
`
`Application
`Speedup
`4.3
`7.3
`11.2
`19.5
`
`4.7
`8.5
`14.7
`34.9
`
`4.8
`9.30
`17.3
`54.8
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 6
`
`
`
`5. PROGRAMMING FLOW VS. DATA
`FLOW
`Algorithms that will be moved into the MAP should be thought
`of as data flow problems from the hardware logic perspective.
`The following example will look at the CAXPY algorithm
`[def: A*X(j) + Y(j) = Z(j)].
`The traditional way for a
`programmer to think of CAXPY is as a loop over the number
`of elements in arrays X and Y. However, in hardware, it is
`
`thought of as a data flow through logic. The logic will read
`values for X and Y every clock and send the values through the
`logic definition to the set of FpMults and FpAdds and create a
`value for Z every clock.
`
`The compiler, as part of its analysis of code segments, creates a
`data flow graph and performs dependency analysis. This
`information can then be used to create an algorithm data flow
`that will be put into hardware logic for the FPGAs.
`
`Algorithm Programming Flow
`
`Algorithm Data Flow
`
`address0, stride, count
`
`address0, stride, count
`
`M_Bank
`1
`
`/ 64
`
`AG 1
`
`AG 2
`
`end_of_data_flag
`
`end_of_data_flag
`
`M_Bank
`2
`
`/ 64
`
`Xr(j)
`/ 32
`
`Xi(j)
`/ 32
`
`Ar
`
`Ai
`
`Yr(j)
`
`/ 32
`
`Yi(j)
`
`/ 32
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`/ 32
`
`/ 32
`
`Zr(j)
`
`Zi(j)
`
`address0, stride, count
`
`/ 64
`
`M_Bank
`3
`
`AG 3
`
`J = 0
`
`J = J + 1
`
`AG 1
`
`AG 2
`
`X
`
`/ 64
`
`Xr(j)
`/ 32
`
`Xi(j)
`/ 32
`
`Ar
`
`Ai
`
`Y
`
`/ 64
`
`Yr(j)
`
`/ 32
`
`Yi(j)
`
`/ 32
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`Yes
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`/ 32
`
`/ 32
`
`Zr(j)
`
`Zi(j)
`
`AG 3
`
`/ 64
`
`Z
`
`J < N
`
`END
`
`MEM 1
`X(j)
`
`MEM 2
`Y(j)
`
`MEM 3
`Z(j)
`
`MEM 4
`
`MEM 5
`
`MEM 6
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 7
`
`
`
`Output
`Data
`
`Output
`Data
`
`Phase 3
`
`1
`clk
`
`Output
`Data
`
`Output
`Data
`
`Function Unit
`
`Phase 2
`
`8
`clks
`
`Function Unit
`
`Elapsed clock - 0
`
`Function Unit
`
`Elapsed clock - 1
`
`Function Unit
`
`Data 1
`
`Data 2
`
`Elapsed clock - 3
`
`Function Unit
`
`Phase 1
`
`1
`clk
`
`Data 1
`
`Data 3
`
`Output
`Data
`
`Data 1
`
`Data 2
`
`Data 3
`
`Data 4
`
`Data 5
`
`Data 6
`
`Data 7
`
`Data 8
`
`Data 9
`
`Elapsed clock - 9
`
`Function Unit
`
`Output
`Data
`
`Data 1
`
`Data 2
`
`Data 3
`
`Data 4
`
`Data 5
`
`Data 6
`
`Data 7
`
`Data 8
`
`Data 9
`
`Data
`
`10
`
`Data
`
`11
`
`Elapsed clock - 11
`
`Input
`Data
`
`Input
`Data
`
`Data 1
`
`Input
`Data
`
`Data 2
`
`Input
`Data
`
`Data 4
`
`Input
`Data
`
`Data
`
`10
`
`Input
`Data
`
`Data
`
`12
`
`Figure 3. Data Flow through Pipelined Function Unit
`
`The arithmetic function units instantiated in the FPGAs have a
`pipeline design. This means that a new data value can be input
`into the function unit every clock.
`There is a latency
`associated with each function unit before a final result
`is
`produced from the input data value. After the first data value
`comes out, subsequent output values will come out every
`clock. Assume that the latency for an FpMult and FpAdd are
`both 10 clocks.
`
`Figure 3 shows how data flows through a pipelined function
`unit.
`In addition, the figure shows a function unit that has
`multiple phases. Floating-point units often have pre- and post-
`processing for format conversion, i.e., conversion from IEEE
`Floating-Point representation into an internal representation.
`
`Going back to the CAXPY example, Table 1 shows how many
`clocks it will take for data samples to pass through stages of
`the hardware logic.
`
`The processing time of hardware logic is similar to that of very
`long vectors on vector processor systems. The time in clocks
`to process a set of data, of length Nelem, through CAXPY
`would be:
`
`Time (clocks)
`
`= Nelems + Latency
`= Nelems + 35 clocks
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 8
`
`
`
`Table 1. Data Sample Processing Latency
`
`Data Flow Logic
`
`Clock
`2 … 13 … 24 … 35
`
`36
`
`1
`
`address0, stride, count
`
`address0, stride, count
`
`M_Bank1
`
`[0:31]
`
`[32:64]
`
`Address
`Generator 1
`
`Address
`Generator 2
`
`end_of_data_flag
`
`end_of_data_flag
`
`M_Bank2
`
`[0:31]
`
`[32:64]
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpMult
`
`FpAdd(neg)
`
`FpAdd
`
`FpAdd
`
`FpAdd
`
`[0:31]
`
`[32:64]
`
`address0, stride, count
`
`M_Bank3
`
`Address
`Generator 3
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 5
`
`D 6
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 5
`
`D 1
`
`D 2
`
`D 3
`
`D 4
`
`D 1
`
`D 2
`
`D 3
`
`D 1
`
`D 2
`
`6. DESIGNING ALGORITHMS FOR MAP
`One of the major issues that have hindered the use of FPGAs to
`date for general-purpose scientific algorithms has been the lack
`of an ability to use floating-point arithmetic. This impediment
`has come from two related factors.
`
`The first has been the actual size, gate count, of FPGAs. Until
`recently, the size of FPGAs has been under 1M gates. The
`advent of multi-million gate FPGAs has dramatically changed
`the possibility of the type of logic than can be loaded into a
`single FPGA chip.
`
`The second limitation has been the size of floating-point (32-
`or 64-bit) functional units for FPGAs. The number of gates
`required to define IEEE-754 compliant function units is much
`greater than those for integer function units.
`
`Table 2 shows the approximate number of function units that
`can be defined in an SRC MAP III.
`
`Another consideration is the amount of the algorithm logic that
`will fit in a single FPGA chip. The MAP environment has the
`capability to extend the algorithm across multiple FPGA chips
`and multiple MAPs. The MAP User Array interconnect
`provides the ability to transfer 24 bytes per clock. A transfer
`can be configured dependent upon the data type being
`transferred. A transfer could send three 8-byte elements, six 4-
`byte elements, twenty-four 1-byte elements, etc. The transfer
`between MAPs via the Chain Ports can send 12 bytes per
`clock.
`This MAP interconnect does not use any switch
`bandwidth.
`
`Table 2. Function Count for MAP III
`
`64b
`
`Function Unit Type
`Integer, 32b
`Multiply
`Add
`Multiply
`Add
`Multiply
`Add
`Multiply
`Add
`
`Floating Point, 32b
`
`64b
`
`Function Count
`140
`3000
`36
`1570
`48
`108
`24
`48
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 9
`
`
`
`7. SRC COMPILER TECHNOLOGY AND
`TOOLS
`SRC is extending a well-known vendor’s FORTRAN and C
`compilers to target compilation of computationally intensive
`portions of applications for FPGAs. Part of the compilation
`process is the generation of a data flow graph (DFG) and the
`dependency analysis of the application code. The DFG and
`dependency analysis are used to define the layout of
`programmable logic in the FPGAs and data layout in on-board
`memory. Two execution-time components will be generated.
`The first is the hardware logic defined from the data and
`control flow analysis. The second is the definition of a
`“Command List”, or COMLIST, for the execution control of
`the User Array environment and issuance of direct memory
`access (DMA) instructions.
`
`Current hardware design can be accomplished through the use
`of Hardware Definition Languages (HDL). An HDL is a high
`level language similar in concept to a software high level
`language such as C. The process of converting the HDL into a
`hardware design consisting of wires and transistors etc.
`is
`called “synthesis”. Synthesis is somewhat analogous to normal
`software compilation.
`
`There are a great number of tools and products available that
`work with various HDL. The compilation strategy for the MAP
`is to leverage the ready availability of synthesis tools etc.
`available for HDL (see Figure 4). The basic premise is that
`compilation for the MAP shall consist of translating the
`software language of the user,( i.e. C or FORTRAN) into an
`HDL representation. This HDL representation can readily be
`processed by existing hardware design toolsets into the
`bitstream used to configure the User Logic (U_Logic).
`
`The
`language is VERILOG.
`The chosen HDL target
`VERILOG language allows representation of hardware design
`at various levels of abstraction. At
`the lowest
`level,
`it
`is
`possible to write VERILOG code that represents individual
`gates and wires. At the opposite extreme, one can write so-
`called “behavioral” code, which has a high degree of
`abstraction. In fact, it is possible to write behavioral code that
`cannot be synthesized into hardware. Such code is useful for
`the purposes of simulation.
`
`It is not the intent of the compiling strategy that the synthesis
`of VERILOG shall
`include the synthesis of floating-point
`operations etc. One of the primary constraints for the compiler
`is compile time. The user of
`the compiler will have
`expectations that the compilation should be done quickly. This
`presents a significant challenge in the “place and route”
`process. Since the compiler will be using existing commercial
`tools, the compiling system must function in such a way as to
`make the job of these tools as easy as possible.
`
`the compiler-generated
`To facilitate the synthesis process,
`VERILOG code shall consist solely of
`instantiation and
`hookup of an existing set of pre-defined VERILOG modules.
`A VERILOG module is similar in concept
`to a software
`subprogram, or subroutine. Importantly, experienced hardware
`designers at a very low design level can create these pre-
`defined modules, in a pre-placed manner. This makes them
`efficient in terms of speed and resource usage, and, since they
`are “relationally placed”, the place and route portion of the
`final synthesis can proceed faster.
`
`The number and type of pre-defined modules is relatively
`small. Just as normal software compilation builds the behavior
`of the user’s program from a small set of basic op-codes, so too
`can the U_Logic configuration be built up from a finite set of
`pre-defined modules,
`such as
`integer/floating-point add,
`multiply, divide, etc..
`
`The development of domain-specific modules (e.g. Convolve,
`FFT) that can be used as building blocks in the design of
`FPGA-based programs is a necessity. These modules provide
`a bridge between the algorithm developer and the hardware
`logic “designer”. The module will have a counterpart in an
`FPGA macro library. These macros will be incorporated into
`the DFG and used in the hardware definition language (HDL).
`Examples of potential libraries for the generation of macros are
`signal and image processing and linear algebra routines.
`In
`addition, application or user-specific modules can be defined
`and added to an FPGA macro library set.
`
`The ability of the user to optionally manipulate the compiler-
`generated logic is extremely important. This manipulation step
`must be available for those customers that want to get the last
`“drop” of algorithmic performance out of the hardware. The
`DFG generated by the high-level language (HLL) compiler
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Figure 4 . Diagram of the SRC Compiler Process
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 10
`
`
`
`will be an optional output. This graph will be modifiable in a
`DFG Editor
`to
`provide
`potentially
`higher
`levels
`of
`optimization.
`
`The on-board memory of the MAP will be used like a
`software-controlled
`cache
`for
`a more
`conventional
`microprocessor. An added bonus is that this “cache” can have
`different user-defined caching strategies. The data access and
`compute strategies will maximize temporal
`locality with
`respect to the use of resident, on-board data and will overlap
`computation on this data with streaming in the “working set”
`of data for the next phase of the computation. We will generate
`both the Common Memory DMA instructions and the on-board
`data layout from the data-flow/dependence analysis.
`
`the
`rapid prototyping of
`the provision for
`Furthermore,
`algorithm in MAP is a desirable tool for the optimization
`process. We are investigating potential
`tools for DFG
`translation
`for
`input
`to
`commercial
`packages;
`e.g.,
`MATLAB/Simulink could provide easy access to prototyping
`and optimizations that a wide variety of people use today.
`
`The available resources of the U_logic FPGA chips in the
`MAP are a finite resource. It will be a simple matter to compile
`user code into a final U_Logic configuration that exceeds the
`physical resources available in the chips. There will obviously
`have to be some feedback from various portions of the
`synthesis tools to the compilation system. The compilation
`system will also have to apply various heuristics in regards to
`precisely what portions of the user code should be targeted for
`the MAP in the first place, and what sort of transformations
`might be applied to the user code to allow for optimal
`performance. The feedback from the synthesis tools and the
`compilation heuristics can only be determined from direct
`experience. Initial versions of the compiler will necessarily
`omit this functionality. Adding this functionality based on the
`learned experience represents a significant portion of the
`overall effort required to create a mature compiler.
`
`8. ALGORITHM STUDIES
`The benefit of using reconfigurable computing has been shown
`in many domain areas[1][2][5]. Several algorithms will be
`reviewed to show the performance improvements over RISC
`and microprocessor-based systems.
`The definition of the
`hardware logic for these algorithms started with DFGs from
`
`The MAP compiler generates an
`our HLL compiler.
`it uses for the generation of HDL.
`equivalent DFG that
`Algorithm performance on MAP is from hardware simulation
`of
`the logic.
`In addition, optimization techniques and
`experiences will be discussed relative to compiler generated
`HDL.
`
`The SRC-developed compiler is in a prototype stage of HDL
`generation. The compiler has demonstrated the ability to
`import pre-defined function unit modules and generate logic
`with correct
`results.
`Development on the compiler
`is
`proceeding in a directed, methodical manner. It does not yet
`fully compile the discussed cases. It can be expected that
`compilation of the following cases will be achieved by the time
`of the presentation of this paper at the SC01 conference.
`
`8.1 Case 1: Convolution – Zero Phase Filter
`This algorithm has been discussed briefly earlier in the paper.
`Let’s examine the 32b floating-point convolution problem with
`a 64-point zero phase filter.
`The critical challenge for the
`compiler is to know how to take a simple convolution code and
`define a method for the scheduling of operations to take
`advantage of
`the opportunity to use a large number of
`multiply/adds relative to traditional processor implementation.
`The MAP can read up to 384b of data from on-board memory
`every clock.
`
`The existing compiler’s HDL implementation would utilize
`only a single input data element every clock. The performance
`of this level of optimization would be 2 Flops/clock. The MAP
`logic can consume the two 32b input data values every clock.
`If the logic can process only one element every clock, then the
`logic has to stall for a clock in order to consume the second
`input data value. Therefore, the processing rate can only create
`an output data element every other clock.
`
`In order to optimize performance relative to the reading of the
`input data, hardware logic was developed that loaded the data
`appropriately into two sets of shift registers for the even and
`odd data elements. Two processing streams were defined to
`process the even and odd input data elements.
`
`Figure 5 shows how the input data is loaded into the shift
`registers for the even/odd input data values. The shift registers
`are loaded up from the read of the first thirty-two 64b data
`values from memory.
`
`input data
`
`o
`
`e
`
`Shift Register for Odds
`
`1
`
`3
`
`5
`
`7
`
`9
`
`. . .
`
`57
`
`59
`
`61
`
`63
`
`64
`
`62
`
`60
`
`58
`
`Shift Register for Evens
`. . .
`
`56
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Fill Direction
`
`Fill Direction
`
`Figure 5. Data Streams Going into Shift Registers
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 11
`
`
`
`After the shift registers are full, the multiplication of the data
`values with the filter coefficients can start. The process is
`pipelined so that the next input values will go into the shift
`registers. The generation of the output points is shown in
`Figure 6 as the contribution of the two shift registers.
`
`This strategy is similar to loop unrolling that the compiler can
`often generate. We are in the process of developing heuristics
`
`for the compiler so that it can automatically generate this level
`of optimization strategy.
`
`The computational performance on MAP for this convolution
`algorithm will be 256 flops/clock.
`This compares to 2
`flops/clock on the Alpha EV6.
`
`Shift Register - Odd Numbered Values
`
`Shift Register - Even Numbered
`Values
`
`Output
`Point
`
`2
`
`0
`
`1
`
`1
`
`3
`
`4
`
`2
`
`0
`
`0
`
`1
`
`1
`
`3
`
`5
`
`7
`
`9
`
`. . .
`
`57
`
`59
`
`61
`
`63
`
`64
`
`62
`
`60
`
`58
`
`56
`
`. . .
`
`8
`
`6
`
`4
`
`2
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`x
`
`x
`
`F60
`
`x
`
`F62
`
`x
`
`F64
`
`x
`
`x
`
`F63
`
`x
`
`F61
`
`x
`
`F59
`
`x
`
`F57
`
`x
`
`x F
`9
`
`x F
`7
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`Clock
`
`1
`
`2
`
`32
`
`F56
`
`F58
`
`2
`
`3
`
`33
`
`3
`
`5
`
`7
`
`9
`
`11
`
`. . .
`
`59 61
`
`63
`
`65
`
`66
`
`64
`
`62
`
`60
`
`58
`
`. . .
`
`10
`
`8
`
`6
`
`4
`
`2
`
`F1
`
`xF
`3
`
`xF
`9
`
`xF
`7
`
`xF
`5
`
`x
`
`F57
`
`x
`
`F59
`
`x
`
`F61
`
`x
`
`F63
`
`x
`
`x
`
`F64
`
`x
`
`F62
`
`x
`
`F60
`
`x
`
`F58
`
`x
`
`F10
`
`x
`
`x F
`8
`
`F6
`
`x
`
`F4
`
`xF
`2
`
`9
`
`. . .
`
`59
`
`61
`
`63
`
`65
`
`66
`
`64
`
`62
`
`60
`
`58
`
`8
`
`6
`
`3
`
`5
`
`7
`
`11
`
`. . .
`
`10
`
`4
`
`2
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`F56
`
`x
`
`F58
`
`x
`
`F60
`
`x
`
`F62
`
`x
`
`F64
`
`x
`
`x
`
`F63
`
`x
`
`F61
`
`x
`
`F59
`
`x
`
`F57
`
`x
`
`x F
`9
`
`x F
`7
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`34
`
`5
`
`7
`
`9
`
`11
`
`13
`
`. . .
`
`61
`
`63
`
`65
`
`67
`
`68
`
`66
`
`64
`
`62
`
`60
`
`. . .
`
`12
`
`10
`
`8
`
`6
`
`4
`
`4
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`x
`
`F1
`
`xF
`3
`
`xF
`9
`
`xF
`7
`
`xF
`5
`
`x
`
`F57
`
`F59
`
`F61
`
`F63
`
`F64
`
`F62
`
`F60
`
`F58
`
`x
`
`F10
`
`x
`
`x F
`8
`
`F6
`
`x
`
`F4
`
`xF
`2
`
`5
`
`7
`
`9
`
`11
`
`13
`
`. . .
`
`61
`
`63
`
`65
`
`67
`
`68
`
`66
`
`64
`
`62
`
`60
`
`. . .
`
`12
`
`10
`
`8
`
`6
`
`4
`
`5
`
`x F
`2
`
`F4
`
`xF
`6
`
`xF
`8
`
`x
`
`F56
`
`F58
`
`F60
`
`F62
`
`F64
`
`x
`x
`x
`x
`x
`x
`x
`x
`x
`Figure 6. Processing Streams for Data
`
`F63
`
`F61
`
`F59
`
`F57
`
`x
`
`x F
`7
`
`x F
`9
`
`F5
`
`x
`
`x F
`3
`
`F1
`
`Figure 6. Processing Streams for Data
`
`Proceedings of the ACM/IEEE SC2001 Conference (SC’01)
`1-58113-293-X/01 $20.00 © 2001 ACM
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2008, p. 12
`
`
`
`8.2 Case 2: Routine SCORE from MAXSEGS
`MAXSEGS is a Smith-Waterman implementation of a genetic
`sequence alignment algorithm. The focal point of this analysis
`has been on the routine, SCORE. The routine, developed by
`Alex Ropelewski[4] of Pittsburgh Supercomputing Center,, is a
`fully vectorized version of the Waterman-Eggert dynamic
`program algorithm. The computationally intensive portion of
`SCORE is shown in the following code segment.
`
`do 200 k=1,ls1+ls2-1,1
` i=min(k,ls2)
` j=max(1,(k+1-ls2))
`
` do 210 index = start(k),end(k),ls2
` left = index-(ls2+1)
` diag = (index-(ls2+1))-1
` up = index-1
`
` extgvi(i)= max(0,extgvi(i)+gap,
` simils(left)+gap+newgap)
` extgvj(j)= max(0,extgvj(j)+gap,
` simils(up)+gap+newgap)
`
` simils(index)=max(simils(diag)+
` wt(s1(j),s2(i)),
` extgvi(i),extgvj(j),0)
` i=i-1
` j=j+1
`210 continue
`200 continue
`
`The goal for any algorithm in the MAP is to create a data flow
`that will maximize the use of input data and computational
`results within a single pass of the logic flow. The challenge
`with this code is to create a data flow that would maximize the
`use of 32-bit data coming from the input arrays EXTGVI,
`EXTGVJ and SIMILS. As explained in Ropelewski et al., the
`key lies in the diagonal access patterns of the algorithm.
`
`Arrays are accessed across diagonals of the matrix in this
`fashion:
`
`Values of start and end shown in the code segment are the
`starting and ending diagonal values − (1,1) and (4,4),
`respectively.
`
`The nested loops are unrolled by a factor of two to utilize the
`two 32-bit values read from on-board memory in the MAP. A
`macro can be made for
`the inner
`loop computation of
`EXTGVI, EXTGVJ and SIMILS. The logic flow for the
`computational macro is shown in Figure 7.
`
`The hardware implementation of SCORE is pipelined (see
`Figure 8). The execution of the SIMILS macro can take a new
`value every clock and generate the values