throbber
Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`Programming the Hawaii Parallel Computer
`
`Richard Halverson, Jr.
`Art Lew
`
`University of Hawaii at Manoa
`Honolulu, HI
`
`Abstract
`
`A new application of field programmable gate-arrays is
`featured in the prototype of the University of Hawaii
`parallel computer (HPC). User programs are compiled
`and then executed partly or completely in one or more
`field programmable gate-arrays (FPGAs). Some
`compile-for-FPGA systems have yet to effectively
`implement full high-level language loop constructs. In
`this paper we show how the conditional logic used for
`generating jump addresses can be consolidated into one
`jump address calculation and computed in a FPGA. In
`addition, we show how this coprocessing can be easily
`added to reduce the number of memory transactions and
`cycles it takes to execute a program. An overview of the
`HPC shared memory architecture is presented. A
`shortest path programming example begins by
`describing the steps for automatically generating the
`FPGA source code and concludes with the results of a
`load-store analysis comparing execution with and
`without the FPGA.
`
`1. Introduction
`
`The HPC prototype is a 4” by 6” five slot platform
`containing its own microprocessor bus into which
`processor/computers can be plugged. The system was
`built for demonstration purposes only, at an absolute
`minimal cost. One slot is reserved for the main
`processor board, which is an Intel MCS-51 8031
`microprocessor with bus master logic and an RS-232
`port for connecting to a host IBM PC. Subordinate
`processors plug in and interface through an expandable
`two-port RAM on each processor board, which allows
`the system to function like a cache-only shared memory
`multiprocessor. Present hardware allows configurations
`of up to three parallel 8031 microprocessors and one
`XILINX board. The XILINX board contains five
`XILINX 3090 FPGAs and five 8K-byte static RAMs.
`Parallel programs can be written for execution (a)
`exclusively on the 8031 array in SIMD fashion, (b)
`exclusively on the XILINX board, and (c) partly on the
`8031s and partly on the XILINX board.
`
`One difficulty in executing programs completely in
`reprogrammable gate arrays is handling looping. Many
`compile-for-FPGA projects have made great progress
`towards synthesizing logic from C-like blocks within
`loops of user programs but much less progress towards
`implementing all varieties of looping mechanisms found
`in today’s high level programming languages. Since
`FPGAs still offer quite minimal reprogrammable gate
`counts, the added parallel processing with the custom
`coprocessor approach is still necessarily fine-grained.
`One problem with these fine-grained systems is that the
`resulting program often requires an increase in the
`number of transactions across the system bus because
`operands still must be loaded and retrieved between the
`coprocessor and the main processor.
`For example, the PRISM-II platform contains an
`Am29050 main processor with slots for several triple
`XILINX 4010 boards for custom coprocessing. So far
`from [1], they have reported quite good results on
`“single-pass” functions without loops. Single pass
`functions pose a problem because the main processor
`must transfer the operands and results back and forth
`from memory which may increase memory transactions
`overall. A goal of PRISM-II is to implement all the
`loop constructs of C which will increase the grain size
`and surely reduce overall the number of memory
`transactions.
`A Reconfigurable Processor Unit (RPU) described in
`[2] is an array of reprogrammable FPGAs attached to a
`memory. Their overall goal of compiling a subset of C
`into FPGA code appears similar to PRISM’s but they
`instead seem to have focused on implementing specific
`parallel models with limited and specialized constructs
`for looping. Unlike PRISM, however, it appears that a
`RPU is capable of fetching its own data from memory
`and writing results back which eliminates any extra load-
`store transactions by the main processor. The RPU as
`described appears most similar to our XILINX board.
`Splash 2 contains one or more boards each with an
`array of 16 well connected XILINX 4010 chips [3]. The
`architecture does an excellent job supporting pipelined
`and SIMD processor configurations. Splash 2, for
`example, can be programmed in dbC, which is a superset
`
`1
`
`IPR2018-01600
`
`EXHIBIT
`2071
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 1
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`of C used on other SIMD computers. The dbC
`preprocessor produces C that runs on the Sun and VHDL
`which define SIMD processors with an instruction set
`tailored to the application, one or more of which fit into
`each XILINX chip. When the actual program executes,
`looping is still handled in the Sun, which transmits
`SIMD instructions to the Splash 2 board(s).
`An Anyboard [4] is a six XILINX chip highly
`configurable board which plugs into an IBM PC. Since
`Anyboard is for prototyping hardware, their SOLDER
`language (similar to C) does provide if-then constructs
`but other program control constructs of C would have
`limited value. This is because when programming in
`the Anyboard environment, the user thinks in terms of
`designing hardware whereas in the other compile-to-
`FPGA projects (including ours), the goal is to translate
`high level language programs (where the user is thinking
`about writing software). Anyboard’s design mapping
`tool for partitioning a design across many chips,
`however, would be useful in any compile-to-FPGA
`project where a compiled function may consume more
`than one FPGA.
`Chameleon is a workstation [5] based on LSI Logic’s
`LR33000 32 bit RISC processor that has a Configurable
`Array Logic (CAL) array of more than 6,000 gates
`attached to the system bus. The CAL array can be
`configured as a coprocessor with its own memory and
`I/O. The Debora language used to program the logic
`array is C like, but intended for describing the state
`transitions of sequential logic. All statements execute in
`parallel except those “guarded” using the IF construct.
`Except for the IF statement, there are no other traditional
`language constructs for defining control flow.
`One way to use the XILINX board on a HPC system
`is for it to compute jump addresses from the variable
`operands of conditional expressions in a program. This
`can be performed mechanically by first representing the
`program as a decision table. The computations are
`boolean in nature and implementing these operations in
`an FPGA is straightforward. By connecting the FPGA
`directly to the system bus to allow registers to capture
`operands as they are written to memory allows operands
`to be loaded into the FPGA coprocessor with no extra
`loads or stores by the main processor.
`This paper describes how the looping control for any
`high level language program can be implemented totally
`in a reprogrammable FPGA to increase performance and
`reduce processor bus transactions. Section 2 begins by
`reviewing decision tables. Section 3 describes the HPC
`architecture and its execution model. Section 4 explains
`the XILINX chip interface to the system bus and how
`they are programmed. Section 5 introduces the shortest
`path program which we hand compile into MCS-51
`processor code and a PALASM definition source file, the
`latter of which will be compiled using the XILINX
`tools. Section 6 shows the results the XILINX
`compilation. Section 7 gives the results of a load-store
`
`analysis comparing optimum processor load-store counts
`with and without the FPGA. Section 8 concludes with
`our current and future research plans.
`
`2. Decision Tables
`
`The decision table is used as the high-level
`programming language to compile because (a) it
`simplifies loop address computation for FPGA
`implementation, (b) it is in fact more expressive than
`conventional imperative languages, and (c) it is general
`purpose, in that it has been shown previously that any
`computer program can be expressed in a decision table
`form [6].
`As Figure 1 illustrates, a decision table consists of
`four quadrants. The upper left contains condition stubs,
`which are expressions that can be evaluated all at once.
`The upper right quadrant lists the condition entries,
`which define columns of possible expression result
`combinations. Multiple ‘T’ entries in a column indicate
`logically ANDed condition stubs which must be true for
`the “rule” (column) to fire. The lower right quadrant
`contains the action entries that indicate row by row with
`X’s, which action stub statements (in the lower left
`quadrant) are to be executed when the rule fires. Note
`that the right half of the table (the entry table) is simply
`an AND-OR array, containing boolean inputs and
`outputs. The process of translating any program (i.e.,
`flowchart) into a decision table is mechanical and
`explained in [6].
`
`1. CONDITION STUBS
`
`Execution begins by evaluating
`
`the conditional expressions giving
`
`true/false results
`
`2. CONDITION ENTRIES
`
`'T' or 'F' entries indicate
`
`which results ANDed will
`
`cause rules to fire.
`
`Rules 1 ... n
`
`CONDITION
`
`STUBS
`
`CONDITON
`
`ENTRIES
`
`EXECUTION
`
`ACTION
`
`STUBS
`
`ACTION
`
`ENTRIES
`
`The ENTRY
`
`TABLE is an
`
`AND-OR array.
`
`Columns define
`
`rules. Condition
`
`results cause rules
`
`to fire, causing
`
`action statements
`
`to execute.
`
`4. ACTION STUBS
`
`Assignment statements execute if
`
`an 'X' action entry appears for the
`
`3. ACTION ENTRIES
`
`'X' entries in a rule column
`
`indicates execute statement to
`
`rule selected
`
`the left.
`
`Figure 1. Decision Table
`
`A decision table program executes by first evaluating
`all the condition stubs simultaneously. The results feed
`the entry table logic which selects a unique rule. This
`rule selection will be performed completely in the
`FPGA, with no processor intervention. The selection of
`the rule defines a program entry point address for the
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 2
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`processor. There, code for all the action stubs in a
`selected entry column is executed. The whole process
`repeats (starting with the reevaluation of the condition
`stubs) until a selected rule causes the program to
`terminate. A special loop variable lambda helps manage
`the iteration process. An example of a Pascal program
`translated into a decision table is shown in Section 5.
`
`3. HPC Architecture
`
`The Hawaii Parallel Computer is a type of cache-only
`shared memory multiprocessor. Processors may be
`heterogeneous. As shown in Figure 2, the architecture
`
`can consist of p processors, designated mP 0 through
`mP p–1. Each is attached to a two-port memory, which
`CPU (mP 0) on one side, and the single subordinate
`
`can be read and written asynchronously by the main
`
`(i.e., “refresh”) the location, so all RAMs contain the
`same information (at that location).
`Figure 3 illustrates the HPC decision table execution
`model. A rule is provided by the FPGA chip which is
`“refreshed” to be visible to all processors. If a processor
`has code to execute for that particular rule, it does so,
`independently of other processors. When the processor
`is complete, it waits for the next rule to be issued.
`
`START
`
`Computed in
`XILINX chip
`
`Select Rule
`
`processor on the other.
`
`“Shared” Memory
`
`Rule
`
`Rule
`
`0
`
`1
`
`Rule
`
`n–1
`
`Rule
`
`n
`
`Executed on one or more 8031s
`
`0...(m–1)
`
`XILINX
`chips
`
`0...(m–1)
`
`XILINX board selects
`
`next rule(s) to
`
`execute, programmed
`
`using PALASM
`
`boolean equations
`
`
`
`2-port
`RAM
`
`0...(m–1)
`
`mP p–1
`
`0...(m–1)
`
`2-port
`RAM
`
`0...(m–1)
`
`mP 1
`
`Rule
`
`Processors
`
`programmed
`
`in MCS-51
`
`processor
`
`code
`
`0...(m–1)
`
`“2-port”
`
`0...(m–1)
`
`RAM
`
`0...(m–1)
`
`mP 0
`
`Main
`CPU
`and
`RAM
`
`I/O
`
`Figure 2. HPC Architecture
`
`All the RAMs map into the same address space on the
`main CPU side (0..m–1) so the data can be written to all
`RAMs simultaneously. (Each RAM can also be
`addressed separately.) The RAM outputs on the main
`CPU side are wire-ORed, so as long as subordinate
`processors write to exclusive locations, the main CPU
`will be able to read subordinate processor output,
`without having to keep track of which processor wrote
`it. Although this means that each processor has free
`read-write capability
`to and from
`the RAM,
`interprocessor communication requires intervention by
`the main CPU. Before one processor can read what
`another one wrote, the main CPU must read and re-write
`
`EXIT
`
`Figure 3. HPC Execution Model
`
`As the model illustrates, more than one processor can
`execute in parallel. This allows the HPC to support
`research in adapting this model to shared memory multi-
`microprocessor programming. In the shortest path
`example below, however, only one processor will be
`necessary.
`
`4. XILINX Board Programming
`
`Eventually, the programs which configure the FPGAs
`will be produced automatically by compiler from a high-
`level programming language, therefore it was best to
`select a text based boolean equation method for defining
`the FPGA logic. Several hardware description languages
`exist for VLSI designs. We chose PALASM because it
`is a flexible yet simple language which was originally
`developed back in the 1970’s, for designing field
`programmable medium scale integrated circuit chips. In
`order to provide a standard PALASM program format for
`different chip sizes and pin configurations, a FPGA
`interface circuit “shell” can be used as illustrated in
`Figure 4. This interface circuit contains all the specific
`pin assignment details for a particular FPGA part and its
`interconnect with the other circuit components. Within
`it is defined a PALASM module which contains the
`program for latching data when written, computing the
`expressions and multiplexing the results out. The
`PALASM macro with pins defined as shown uses the
`PALASM “.PDS” file format. Address, control, data in
`and data out pin names are predefined, in order, using the
`
`3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 3
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`same names as in the outer schematic. Following the
`EQUATIONS statement appears the compiler generated
`boolean equations which define the input registers,
`combinational logic for computing the expressions and
`output multiplexers for the particular user program.
`Tools for programming XILINX chips are described in
`more detail in [7].
`
`PALASM File
`
`TITLE XXXXXXX
`AUTHOR HPC PROGRAMMER
`DATE MMMMM YY, 19ZZ
`
`CHIP XXXXX LCA
`
`;ADDRESS/CONTROL INPUT (18 PINS)
`A1 A2 A3 A4 A5 A6 A7
`A8 A9 A10 A11 A12 A13 A14 A15
`RDC WRLC WRHC ;READ, WRITE LOW, HIGH BYTE
`
`;DATA INPUT (18 PINS)
`DI0 DI1 DI2 DI3 DI4 DI5 DI6 DI7
`DI8 DI9 DI10 DI11 DI12 DI13 DI14 DI15
`GCLK ZERO ;GLOBAL CLOCK, LOGICAL ZERO
`
`;DATA OUTPUT (16 PINS)
`; (active low)
`DO0 DO1 DO2 DO3 DO4 DO5 DO6 DO7
`DO8 DO9 DO10 DO11 DO12 DO13 DO14 DO15
`
`;INTER-FPGA DAISY-CHAIN (2 PINS)
`DOEIN ;DATA IN FROM PREVIOUS CHIP
`DOE ;DATA OUT TO NEXT CHIP
`
`EQUATIONS
`
`Address select logic
`
`Input registers
`
`Expression logic
`
`Output multiplexers
`
`Viewlogic File
`
`PALASM
`
`.PDS File
`
`Data
`
`Out
`
`; END
`
`FPGA
`
`Addr
`
`Read
`
`Write
`
`Data
`
`In
`
`Daisy-Chain
`
`Daisy-Chain
`
`(e.g., XC3090PG175)
`
`Figure 4. Programming the FPGA
`
`5. Example: Shortest Path Program
`
`To illustrate program compilation and execution of a
`decision table, a shortest path program which first
`appeared in [8] will be used. Figure 5 shows a Pascal
`
`version of a shortest path algorithm for a topologically
`sorted graph with distances stored in a two dimensional
`array d. As we see, the first two statements after the
`begin statement will be executed once. The body of the
`outer while loop (including the comparison of i with 1)
`will be executed n–1 times. The inner while loop
`executes n-1 times by being inside the outer loop, times
`n/2 more times. Within the most inner loop, the if test
`and the j increment always occurs with the body of the
`if-then executing depending on characteristics of the data.
`Figure 6 shows an equivalent decision table [6]. Rule
`0 executes once at the beginning, which sets lambda=1.
`In the second pass, with lambda=1, either Rule 1 or 6
`will execute, depending on if i‡1. Rule 1 will execute
`n–1 times, each time setting lambda=2. Rule 6 will
`execute once at the end. When lambda=2, Rules 2 and 5
`compete, depending on if j£n. Rule 2 executes n(n–1)/2
`times while Rule 5 executes n–1 times. If j£n and
`lambda=2, then lambda is set to 3 causing Rule 3 or 4 to
`execute next. In the worst case, d[i,j]+f[j]<min causes
`Rule 3 to execute n(n–1)/2 times with Rule 4 executing
`not at all. In the best case, Rule 3 executes
`n(n-1)/2-(n-1) times while Rule 4 executes n-1 times.
`These iteration equations for the loops in Pascal and
`rules in the decision table will be used for the load store
`analysis described in section 7.
`
`6. Compiling
`
`Figure 7 contains the memory map and a flow chart
`showing assignment statements which must be compiled
`into MCS-51 assembly language. The d array is
`allocated starting at location 0000. The f and t arrays are
`above that at 3200 and 32A0. The scalar variables are
`allocated starting at 3340 with n, followed by i, j, min,
`ptr, lambda, Rule•3, “d[i,j]” and “f[j]”. “d[i,j]” and “f[j]”
`are specially allocated locations which hold the value of
`d[i,j] and f[j] depending on the current values of i and j.
`They are required because they are needed in the FPGA
`calculation of the rule, as are all variables marked with
`an asterisk (*). These variables will be latched in
`registers in the FPGA whenever written. The carrot (^)
`indicates an output of the FPGA, which in this case is
`the selected next rule to execute multiplied by 3, which
`will serve as an offset into a jump table in the 8031
`program (as LJMP instructions are 3 bytes long).
`The 8031 program selects a rule by reading memory
`location 334E into the accumulator, which contains the
`next rule to execute multiplied by 3. It then performs a
`jump into a table of jump instructions to the code for
`that rule. When the rule execution is complete, it jumps
`to the top to select the next rule. As shown in the
`diagram, executing Rule 6 causes the program to
`terminate.
`Since “d[i,j]” and “f[j]” are only used when lambda=3,
`then they need to be updated only when lambda is set
`equal to 3. The only time this occurs is in Rule 2, so as
`
`4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 4
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`shown in Figure 7, only Rule 2 must contain code to
`update “d[i,j]” from d[i,j] and “f[j]” from f[j]. Rule 3
`executes when lambda=3 and is the only rule that uses
`d[i,j] and f[j], therefore it can use “ d[i,j]” and “ f[j]” to
`save having to compute the address of d[i,j] and f[j] over
`again.
`
`begin
` f[n] := 0;
` i := n – 1
` while i >= 1 do
` begin
` min := maxint;
` ptr := n + 1;
` j := i + 1;
` while j <= n do
` begin
` if d[i,j] + f[j] < min then
` begin
` min := d[i,j] + f[j]
` ptr := j
` end; {if}
` j := j + 1
` end; {while j <= n}
` f[i] := min;
` t[i] := ptr;
` i := i – 1
` end; {while i >= 1}
`end
`
`Figure 5. Pascal Shortest Path
`
`MEMORY MAP
`
`d[80,80]
`
`0000
`
`8031 PROGRAM
`START
`
`f[80]
`
`t[80]
`
`n*
`
`i*
`
`j*
`
`min*
`
`ptr
`
`3200
`
`32A0
`
`3340*
`
`3342*
`
`3344*
`
`3358*
`
`334A
`
`lambda*
`
`334C*
`
`Rule•3^
`
`334E^
`
`"d[i,j]"*
`
`3350*
`
`"f[j]"*
`
`3354*
`
`*
`
`Whenever a *
`
`memory location is
`
`written, the datum is
`
`at the same time
`captured in a FPGA
`
`register which
`
`automatically causes
`
`Rule•3 to recompute
`
`^
`
`The computed
`
`Rule•3 value is gated
`
`out of the FPGA onto
`
`the data bus when
`
`address 334E is
`
`read. The FPGA
`
`value is ORed with
`
`RAM, which must
`
`contain 00 at this
`location.
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`EXIT
`
`SELECT RULE
`Read Rule•3^
`
`Jump to Rule
`
`f[n] := 0
`
`i* := n–1
`
`lambda* := 1
`
`min* := maxint
`ptr := n+1
`
`j* := i+1
`
`lambda* := 2
`
`"d[i,j]"* := d[i,j]
`
`"f[j]"* := f[j]
`
`lambda* := 3
`
`min* := "d[i,j]"+"f[j]"
`ptr := j
`
`j* := j+1
`
`lambda* := 2
`
`j* := j+1
`
`lambda* := 2
`
`f[i] := min
`
`t[i] := ptr
`
`i* := i–1
`
`lambda* := 1
`
`Rule:
`
`0 1 2 3 4 5 6
`
`Figure 7. Memory Map and 8031 Program
`
`lambda =
`i >= 1
`j <= n
`d[i,j]+f[j] < min
`
`f[n] := 0
`i := n – 1
`min := maxint
`ptr := n + 1
`j := i + 1
`min := d[i,j] + f[j]
`ptr := j
`j := j + 1
`f[i] := min
`t[i] := ptr
`i := i – 1
`exit
`lambda :=
`
`0 1 2 3 3 2 1
`- T - - - - F
`- - T - - F -
`- - - T F - -
`
`X - - - - - -
`X - - - - - -
`- X - - - - -
`- X - - - - -
`- X - - - - -
`- - - X - - -
`- - - X - - -
`- - - X X - -
`- - - - - X -
`- - - - - X -
`- - - - - X -
`- - - - - - X
`1 2 3 2 2 1 %
`
`Figure 6. Decision Table Shortest Path
`
`The calculation of the next rule (and its multiplication
`by 3) is performed in the XILINX chip. As Figure 7
`indicates, within the chip are allocated registers for
`lambda, i, j, n, “d[i,j]”, “f[j]” and min. They are clocked
`from the write signal and enabled from address bus
`decoders indicating when their respective memory
`location is being written by the processor. This means
`that the register locations always contain the most up to
`date value, without any extra memory transactions being
`required of the processor. Whenever a register changes,
`Rule•3 is automatically recalculated in combinational
`logic.
`As Figure 8 suggests, prespecified “operator macros”
`are useful for automating the generation of the
`PALASM code. U1 is “instantiated” from a generic
`8-bit in, 1-bit out “>0” macro. U2’s output says
`whether its two inputs are equal. U3’s two inputs sum
`to its output, which feeds U4. The single bit outputs of
`U1, U2 and U4 are the results of the three condition
`stubs in the decision table. These three bits, along with
`the L1 and L0 bits of lambda feed U5, which determines
`which rule to execute next.
`
`5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 5
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`#
`Gates
`
`#
`Latches
`
`Max
`Level
`
`#
`CLBs
`
`PDS2XNF
`
`386
`
`7 4
`
`2 4
`
`7 4
`
`#
`CLBs
`
`#
`Pins
`
`#
`Nets
`
`Placement
`Time
`hh:mm:ss
`
`Routing
`Time
`hh:mm:ss
`
`APR 111
`
`775
`
`206
`
`3:55:24
`
`1:14:15
`
`Figure 9. FPGA Compilation Statistics
`
`The automatic place and route (APR) statistics show
`that 111 CLBs were consumed overall. The extra 37 are
`used in the “circuit shell” which interconnects the
`standard chip independent PALASM file to the circuit
`dependent part. Those CLBs took 3 hours and 55
`minutes to place. 206 nets took 1 hour and 14 minutes
`to route.
`
`7. Execution
`
`Once the MCS file has been created for the XILINX
`chip, it can be loaded along with the assembled machine
`language file for the 8031. Executing rules of a decision
`table compared to an in-line iterative program adds the
`necessary overhead of assigning lambda. On the HPC,
`however, this is offset by the reduced cycle counts for
`calculating whether the loops should iterate or terminate,
`because these expressions are computed in data flow
`fashion in combinational logic instead of sequentially
`through an ALU inside the processor. Since the
`operands used for loop control are captured in FPGA
`registers as they are written to memory by the processor,
`we in fact observe a reduction in the overall number of
`loads and stores required by the processor when the
`FPGA is used.
`The actual load plus store counts for the HPC were
`not used for this analysis because it would be
`meaningless on an 8031. Instead, the analysis assumes
`an optimal RISC processor, where operands for
`computing any expression are loaded exactly once for
`each statement, and all intermediate results are stored in
`registers. Only data memory loads and stores were
`counted (i.e., no constants). No optimizations were
`assumed amongst groups of statements within a block.
`Figure 10 shows the worst case loads+stores count as
`the array size n increases from 10 to 80. When n=80,
`we see that the original Pascal program would require a
`total of 45,904 loads and stores. When the decision
`table–FPGA method is used, the load+store count drops
`13.68% to 39,625. This improvement results mostly
`because operands tested for loop control must be fetched
`upon each loop iteration in the Pascal program whereas
`on the HPC, they are captured automatically by the
`FPGA. For loop control on the HPC, only the Rule•3
`
`334C
`
`3342
`
`3344
`
`3340
`
`3350
`
`3354
`
`3348
`
`lambda
`
`i
`
`j
`
`n
`
`“d[i,j]”
`
`“f[j]”
`
`min
`
`> 0
`
`U1
`
`=
`
`U2
`
`U3
`
`+
`
`<
`
`U4
`
`L1,L0
`
`C1
`
`C2
`
`C3
`
`L1,L0
`
`C1
`
`C2
`
`C3
`
`Rule•3
`
`0000 0000
`
`0000 0011
`
`0000 0110
`
`0000 1001
`
`0000 1100
`
`0000 1111
`
`0001 0010
`
`– – – T F – –
`
`– – T – – F –
`
`– T – – – – F
`
`00
`
`01
`
`10
`
`11
`
`11
`
`10
`
`01
`
`U5
`
`BIT BY BIT PALASM EQUATIONS FOR RULE•3:
`
`R0=/L1*L0*C1+L1*L0*C3+L1*/L0*/C2
`R1=/L1*L0*C1+L1*/L0*C2+L1*/L0*/C2+/L1*L0*/C1
`R2=L1*/L0*C2+L1*L0*/C3+L1*/L0*/C2
`R3=L1*L0*C3+L1*L0*/C3+L1*/L0*/C2
`R4=/L1*L0*/C1
`R5=0
`R6=0
`R7=0
`
`Rule•3
`
`334E
`
`Figure 8. XILINX Rule•3 Computation
`
`U5 in Figure 8 shows a truth table of what Rule•3
`should contain depending on the values of L1, L0, C1,
`C2 and C3. We notice that the lowest order bit of
`Rule•3, R0, is a 1 when lambda=01 and C1=T or when
`lambda=11 and C3=T or when lambda=10 and C2=F, as
`is reflected in the PALASM equation for R0. The
`second lowest order bit, R1, is a 1 when lambda=01 and
`C1=T or when lambda=10 and C2=T or when lambda=10
`and C2=F or when lambda=01 and C1=F. The reader
`may note that the equations of Rule•3 can be minimized
`a great deal. This is not necessary because these
`functions are implemented as truth tables inside the
`XILINX CLBs anyway.
`The PALASM code for implementing the flow
`diagram in Figure 8 was about 10 pages. Figure 9 lists
`the important compilation statistics using the XILINX
`PDS to XNF file (PDS2XNF) tool and the automatic
`place and route (APR) tool. 386 reflects the number of
`1, 2, 3, 4 or 5 input gates for implementing operators
`U1 through U5, as well as the output multiplexer and
`input register clock enable logic. 74 latches includes 2
`bits for lambda, 8 each for i, j and n, and 16 each for
`“d[i,j]”, “f[j]” and min. The maximum level of 24
`reflects the critical timing path which is from when
`“d[i,j]” or “f[j]” is written to Rule•3 being available.
`The gates and latches consumed 74 of the 320 CLBs in
`the chip.
`
`6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 6
`
`

`

`Proceedings of the 2nd ACM International Workshop on FPGAs, Berkeley, CA, Feb. 1994
`
`Since this work, we have made further progress in
`automatically translating programs for execution in their
`entirety on reprogrammable FPGAs. We have extended
`our technique and enhanced our XILINX board (to how it
`is now described in the Section 1) to allow all a
`program’s expressions to be computed in FPGAs, in
`what we now call functional memory. When all
`program expressions are calculated in functional
`memory, the necessary processor instruction set reduces
`to a simple set of moves and one jump. This allows the
`processor to be streamlined to less than that which is
`necessary for RISCs. No working registers or ALU are
`needed. Memory transactions and cycle counts for entire
`programs reduce quite substantially, especially those
`with complex expressions. Functional memory,
`however, still has its limitations. Our research is
`presently focused on how functional memory can be
`adapted to a wide range of applications.
`
`variable is read, no matter how complex the branch
`determination equation might be.
`
`45,904
`
`39,625
`
`50,000
`
`40,000
`
`30,000
`
`20,000
`
`10,000
`
`0
`
`Loads+Stores
`
`10 20 30
`
`40 50 60
`
`70 80
`
`N Size
`
`Pascal
`
`HPC
`
`References
`
`Figure 10. Load-Store Analysis
`
`Note that the load store analysis minimizes somewhat
`the actual performance improvement we would actually
`observe because it ignores the elimination of the
`sequential processing previously required by the
`processor to evaluate the conditional expressions.
`
`8. Conclusion
`
`We have described an application of field
`programmable gate-arrays featured on a prototype parallel
`computer at the University of Hawaii. The project’s
`goal is an automated system where high level user
`programs can be compiled and then executed partly or
`completely in FPGAs. In this paper, we showed how
`the conditional logic used for generating a program’s
`jump addresses can be extracted from a user program for
`“execution” on a FPGA. The design is general–purpose
`in the sense that any computer algorithm can be
`expressed in decision table form and therefore executed
`on the HPC, although (as in the case of all architectures,
`including von Neumann’s) possibly inefficiently. We
`do, however show how this technique can offer
`coprocessing without adding the burden of extra memory
`transactions to load and store data. An overview of the
`HPC shared memory architecture was presented. A
`shortest path programming example was used to describe
`the steps for automatically generating the FPGA source
`code. We concluded with the results of a load-store
`analysis comparing execution of the program on an
`optimal RISC processor with and without the FPGA.
`
`[1] Wazlowski, M., Agarwal, L., Lee, T., Smith, A.,
`Lam, E., Athanas, P., Silverman, H. and Ghosh,
`S., “PRISM-II compiler and architecture,” Proc.
`IEEE Workshop on FPGAs for Cust. Comp.
`Mach., 1993, 9-16.
`
`[2] Guccione, S. and Gonzalez, M., “A data-parallel
`programming model for reconfigurable
`architectures,” Proc. IEEE Workshop on FPGAs for
`Cust. Comp. Mach., 1993, 79-87.
`
`[3] Gokhale, M. and Minnich, R., “FPGA computing
`in a data parallel C,” Proc. IEEE Workshop on
`FPGAs for Cust. Comp. Mach., 1993, 94-101.
`
`[4] Van den Bout, D., “The Anyboard: programming
`and enhancements,” Proc. IEEE Workshop on
`FPGAs for Cust. Comp. Mach., 1993, 68-77.
`
`[5] Beeb, B. and Pfister, C., “Chameleon: A
`workstation of a different colour,” in Lecture Notes
`in Comp. Sc. 705, Field Prog. Gate Arrays: Arch.
`and Tools for Rap. Prot., Grunbacher, H. and
`Hartenstein, R., eds., Springer-Verlag: Heidelberg,
`1993, 152-161.
`
`[6] Lew, A. 1982, “On the emulation of flowcharts by
`decision tables,” Communications of the ACM,
`25(12), 895–905.
`
`[7] XILINX 1991, The Programmable Gate Array Data
`Book – 1992, XILINX Inc., San Jose, CA.
`
`[8] Lew, A. and Halverson, R., Jr. 1993, “Dynamic
`programming, decision tables, and the Hawaii
`Parallel Computer,” Proc. 5th Bellman Continuum,
`Jan. 1993 (to appear in Computers and Mathematics
`with Applications).
`
`7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2085, p. 7
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket