`
`82
`
`A Data Parallel Programming Model
`
`Chapter 7
`
` ,
`
`
`:
`CounterN
`
`Counteril Counter]:
`
`FIGURE 7.3 Bitstream
`Cross—Correlation
`
`is incremented. Comparison of the two bits can be performed with exclusive—or or
`exclusive—nor. At least one of the bitstream data streams moves from cell to cell, but
`the counter data is stationary, as shown in Figure 7.3.
`In this example, stream a is shifted systolically through the array, while stream b
`is broadcast to each PB. The counter R accumulates the result of the correlation.
`The user specifies the size and shape of the processor array by initializing the
`predefined variables DBC_ne-I;, which gives the number of dimensions, and DBC_ne-t_
`shape, which gives the rank of each dimension. The dbCISplash 2 compiler currently
`supports only linear arrays.
`The keyword poly indicates the declaration of parallel variables or data types.
`Integers and logicals in the parallel domain may have arbitrary user-defined bit length.
`In the example program, the variable a is a one-bit parallel variable, and the counter R,
`which holds the result, is a 16—bit unsigned parallel integer. The variable In is a normal
`C variable that is stored on the host.
`.
`
`The all keyword indicates that all processors are to participate in the body of
`the compound statement. The body of the all contains an initialization of the parallel
`variable a and a sequential for statement. Within the for loop,
`there are three
`statements. The first statement initiates host-to-processor communication: the host
`writes a l
`to processor 0’3 51. The second updates the counter R. On each processor,
`the result of a XORed with the least—order bit of b is added into R. Finally, each
`processor in the linear array shifts its value of a to the right. The final statement of
`the loop simply reads and prints the value of the counter R from a specific processor
`(b mod 64). This type of register examination has traditionally been very difficult
`for programmers to design into Splash 2 programs.
`
`7.3 COMPILING FROM dbC TO SPLASH 2
`
`Compiling dbC programs for the Splash 2 system occurs in two phases. First, the
`dbC translator emits sequential C code with embedded parallel instructions. These
`parallel
`instructions are three-address memory—to—memory instructions (“Generic
`SIMD”).
`.
`
`When a dbC program is compiled for a traditional SIMD machine (CM-2 or
`TERASYS), the generic SIMD instructions are interpreted by microcode libraries
`(such as the Paris microcode library for the CM—2). These runtime libraries also
`support intrinsic functions such as DBC_read_from_pr0C and DBC_net_send.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 408
`Petitioner Microsoft Corporation - EX. 1066, p. 408
`
`
`
`Section 7.3
`
`Compiling from dbC to Splash 2
`
`83
`
`To compile for Splash 2, an additional compilation phase is invoked. A Splash 2
`specific phase focuses on the parallel instructions that were created by the previous
`phase, and assembles from those parallel instructions a specialized SIMD engine in
`structural and behavioral VHDL.
`
`The virtual PEs making up this SIMD engine have an instruction set containing
`all of the generic SIMD instructions appearing in the generated C code. Thus the
`compiler synthesizes a different instruction set for each different program.
`In addition to constructing a custom SIMD engine for the application on
`Splash 2, the dbC compiler also generates the host program. This program executes
`the sequential operations of the dbC program on the host and sends parallel instruc-
`tions to the Splash 2 Array Board in the sequence specified by the dbC program. The
`compilation steps are enumerated below.
`
`7.3.1 Creating a Specialized SIMD Engine
`
`We demonstrate the steps required to configure Splash 2 as a SIMD machine with
`the small example of Section 7.2.2.
`Phase 1 of the dbC translator does item 1 below. All subsequent steps are
`performed by the Splash 2 specific phase 2. Starting from the dbC program,
`the
`steps are:
`
`1. Generation of the Generic SIMD code.
`
`2. Determination of registers and data movement between registers. The data path,
`rather than being the generalized data path found in general-purpose computers,
`is customized on a per-program basis.
`3. Determination of the control structure, that is, what decoders for instructions are
`needed and what those decoders must control. The decoders are also customized
`
`for the program.
`
`4. Establishment of inter-PE (and inter-chip) data paths and state machines for
`nearest—neighbor communication.
`
`5. Establishment of inter—PE (and inter-chip) data paths and state machines for
`global combining operations. The Xi’s, X0, and host must synchronize during
`a global reduce.
`6. Generation of:
`
`a) VHDL types for the data types;
`b) VHDL SIGNALS for the variables;
`c) VHDL control statements for the instruction decode;
`(1) Appropriate VHDL assignment statements for each of the operators;
`8) Port declarations and interconnection to support nearest-neighbor commu-
`nication and global reduction;
`1') Generation of state machines to sequence the multi-tick operations.
`
`7. Generation of the host program to perform sequential operations and send par-
`allel instructions to the Splash 2 Array Board.
`
`8. Synthesis of VHDL to Xilinx-specific configuration bitstreams, which are down«
`loaded to the chips. This process uses commercial CAD tools.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 409
`Petitioner Microsoft Corporation - EX. 1066, p. 409
`
`
`
`
`
`
`
`84
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`OpParMoveZero_lL_a (a . address .
`for (b=0;b<128:b——+){
`l),-
`l, 0,
`DBC_write_to_proc(a.address,
`opPaer0r3C_lea (opParAddOffset (_DBC_poly_frame_t_rnain, 2)
`/* t3:;:2 */, a.addrese, b, 1);
`0pParAdd2_2 L_a [R . address ,
`opParAddOffset{_DBC_poly_frame_t_main, 2)
`/* t3:Z_:2 */, 15, 1):“
`
`:3BC_net_send(a.address. a, right,
`1},-
`printf("%d \n", DBC_readr_fr0m_proctR, b%64] );
`/* end for */
`
`l) :
`
`
`
`}
`
`FIGURE 7.4
`
`C + Generic SIMD Code for Correlation
`
`7.3.2 Generic SIMD Code
`
`To begin the process of compiling the correlation program for Splash 2, we trans-
`late the dbC to sequential C plus calls to Generic SIMD operators. A fragment of
`the Generic SIMD code for our correlation program is shown in Figure 7.4. Each
`“function” call prefaced by opPar is a Generic SIMD instruction.
`The instruction name describes both function and parameters. The op Par prefix
`is followed by the operation, for example, MoveZero in the first instruction. Next,
`many instructions have a number signifying the number of operands, for example
`the “3” in the opPaeror3C_lL instruction. If one of the operands is a constant, as
`in the XOR instruction, a c follows. Next, after the underscore, the number of bit
`
`lengths that will follow is specified as a number followed by L. A final suffix _a
`indicates that the operation is to be performed unconditionally on each PE (even if
`the context bit is reset).
`In our example, the MoveZero instruction clears the single—bit parallel vari-
`able at. The intrinsic DBC_write_to_proc writes a 1
`into processor 0’s a. The
`onr3c instruction performs a Boolean XOR of a and the least—order bit of ‘12
`into a compiler-generated temporary. Next that temp is added into R in the Addz
`instruction. The DBC_net_send shifts at from each PE to its right neighbor. Finally,
`the DBC_read_from_proc reads R from processor i.
`
`7.3.3 Generating VHDl.
`
`In the second phase of compilation to Splash 2, the Generic SIMD code is processed
`by a specialized backend. The Splash 2 specific Phase 2 generates two chip descrip—
`tions, which are VHDL programs for computational chips Xl—Xlo and the control
`chip X0, respectively. The computational chips hold the SIMD Processing Elements,
`while the X0 control chip is used for host-PE communications, global combining,
`and instruction broadcast.
`
`In Phase 2, we create an instruction set derived from the opPar commands
`
`and intrinsic calls generated by Phase 1. The instruction set for this SIMD engine
`
`array.
`
`2The latter is a variable on the host, and therefore a constant from the point of view of the SIMD
`'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 410
`Petitioner Microsoft Corporation - EX. 1066, p. 410
`A
`
`
`
`Section 7.3
`
`Compiling from dbC to Splash 2
`
`85
`
`33 32
`
`Instruction
`24
`'
`
`3
`
`0
`
`
`II- I
`
`start mulli—tick operation
`stop muiti—tick operation
`
`FIGURE 7.5 SIMD Instruction Format
`
`is customized to the specific instruction instance. For example, the Boolean XOR
`instruction we synthesize expects the operands to be the variable a and the least—order
`bit of b and the result to go into t3. Thus there is no need for runtime computation
`of source and destination, at data path to compute and gain access to arbitrary source
`and destination, or much of the other complexity that comes with a general-purpose
`instruction set.
`'
`
`The instructions are in a fixed format, shown in Figure 7.5. The least—order
`eight bits contain the opcode. The next 16 bits, labeled “Operand” in the figure,
`contain an immediate value, if required by the instruction. For example, our XOR
`instruction, the opPaeror3 C_1L_a, requires a constant to be passed as one of the
`operands to the operation. In the example (see Figure 7.4), the current value of b
`is the second operand of the XOR, and thus gets passed to each PE through the
`Operand field.
`The third field of the instruction, PE#, is an optional processor number used
`for those instructions that are to be executed only by specific single processors. In
`our cross-correlation program, for example, the DBC_read_from_proc instruction
`reads the result from a different processor on each iteration of the loop, Processor to.
`The generated instruction therefore wfims the current value of b mod 64 in the PE#
`field. The final high-order bits are used to synchronize between X0 and the host in
`multi-tick operations.
`Figure 7.6 outlines the interaction between the host and the generated SIMD
`engine. The controlling program on the host executes a sequence of instructions, some
`of which are executed locally on the host and some of which are sent as commands
`to the SIMD engine. In our example, the loop control of the for-loop is done on the
`
`f
`
`Chilefi
`PE PE PE
`61 62 63
`
`control Io_ ic
` control logic
`
`Inshuction
`and Data
`
`SPARE HOST
`
`FIGURE 7.6 The Generated SIMD
`Engine for Cross—Correlation
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 411
`Petitioner Microsoft Corporation - Ex. 1066, p. 411
`
`
`
`86
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`Mulli—Tick
`Decoder
`(ibus.lick)
`
`,
`
`-
`
`..'
`
`Operand
`Pipeline
`{ibusopemnd}
`
`Instruction
`
`.:_ : xP_erHr
`
`
`
`XPJJUR
`
`XP_GOR_\«" ALJD
`
`FIGURE 7.7 A Single Computational Chip
`
`host. In the body of the loop, the parallel instructions are broadcast to the SIMD
`engine, with each PE (0—63) operating independently on its data.
`Figure 7.? shows the layout of each computational chip. Multiple PEs are
`instantiated on each chip, with the number of PEs per chip being determined by
`the user-specified size of the processor array. In this example, there are a total of
`64 P135, so four are placed on each of the 16 chips. Instructions are sent through the
`XL FIFO to X0, which broadcasts them over the crossbar to each chip. At the chip,
`the instruction is decoded and sent to all the SIMD PES on the chip, along with the
`Operand and PE#.
`This program contains a call to the intrinsic DBCfiroad_from_prOC, which
`returns to the host the value of R on a specific PE. We use a special form of the
`“reduce logic" (see Section 7.4 on Global Operations) to implement this instruction.
`The figure shows that
`the SIMD instruction (“ibus”) comes into the chip on the
`XP_XBAR port. There the opcode is decoded, and the decoded opcode, along with
`the other fields, is passed to each SIMD PE. The SIMD processors contain logic to
`execute the instructions. In addition, for this correlation program, they are connected
`to each other linearly through the “communicate bus” over which the value of a is
`shifted right. The “reduce data” shown flowing out of each SIMD PE is the value of
`R, which is read from successive His and sent to the host. The serializers and reduce
`
`logic are explained in Section 7.4. The value of R from the selected PE is sent two
`bits at a time out the XP_GOR and XP_GOR_VALID lines to X0.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 412
`Petitioner Microsoft Corporation - Ex. 1066, p. 412
`
`
`
`Section 7.3
`
`Compiling from dbC to Splash 2
`
`87
`
`XOFSIMD
`(from SPARC host)
`
`Multi-Ticlr
`Decoder
`
`X0_XB_DATA
`(to XP chips)
`
`XO_GOR_RESULT 1N
`X0_GOR_VALID_]N
`(from XP chips)
`
`instruction
`Decoder
`
`(ibus.'|nstr ction)
`
`
`
`XP_GOR_VALID
`
`KP_GOR_RESULT
`
`FIGURE 7.8 The Control Chip
`
`instruction
`Figure 7.8 shows the layout of the control chip X0. The 36—bit
`word comes in on X0_SIMD from the SPARC host. XO’s instruction decoder and
`multi-tick decoder are identical to those of the computational chips. The reduce logic
`is similar to that of the computational chips, with the major difference being that
`the data enters bit-serially on the X0_GOR_RESULT_IN and X0_GOR_VALID_IN
`lines and goes to the host bit serially on X0_GOR_RESULT and X0_GOR_
`VALID.
`
`The final step is to generate the host program. Each opPar instruction is
`replaced with a Splash 2 specific instruction. as shown in Figure 7.9. A SPLASH-
`TNSTRUCTION simply writes the parameter to the operand field of the SIMD Bus and
`then steps the clock, which issues the instruction to X0. The SPLASH_W_INSTRUCTION
`'writes an 8 to the opcode field, a l
`to operand, and a 0 to the PE# field. The
`SPLASHJiPJNSTRUCTION writes a 3 to opcode, a 16 to operand (the bit length,
`which is required to control the reduce logic), and the current value of i modulo 64
`to the PE#. Note that the modular reduction is performed on the host, and the result
`of the mod is sent to the processor array. The DBC_net_Send instruction is broken
`into two parts:
`the first copies a to the communicate output port, and the second
`copies the communicate input port into a. The nearest—neighbor communication is
`explained further in the next section.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 413
`Petitioner Microsoft Corporation - EX. 1066, p. 413
`
`
`
`
`
`88
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`/* OPPARMOVEleL */
`SPLASH_INSTRUCTION(9);
`f0rli:0;i<l28;i++}{
`SPLASH_W_INSTRUCTION(8 /* OPPARWRITETOPROC */, 1, 0);
`SPLASH_INSTRUCT:0N(7);
`/* OPPARBXOR3C_1L */
`SPLASH_INSTRUCT:0N(61;
`/* OPPARADD2_2L */
`SPLASH_INSTRUCT:ON(5}:
`SPLASH_INSTRUCT:ON(5);
`/* OPPARNETSENDl */
`SPLASH_INSTRUCTION(5);
`/* OPPARNETSENDZ */
`SPLASH_INSTRUCTION(4};
`printf('%d \n', SPLASHWRP_INSTRUCTION{3, 16
`/* OPPARREADFROMPROC */,
`ti%64)});
`
`}
`
`
`
`FIGURE 7.9
`
`Final C Program for Correlation
`
`7.4 GLOBAL OPERATIONS
`
`The data parallel model encompasses a number of global operations in which all
`(active) Processing Elements participate. 0n Splash 2, we have implemented two
`classes of data parallel operations,
`
`0 DBC_net_send, a nearest-neighbor linear communication pattern in which each
`PE sends a value to its right neighbor. PE 0 receives its value from the host
`through the Operand field of the instruction.
`
`0 Reduce operators MAX, MIN, AND, OR, and XOR, which perform the indi—
`cated operation across the entire virtual PE array and send the result back to
`the host.
`
`7.4.1 Nearest-Neighbor Communication
`
`Left—to-right communication is accomplished with structural connections between
`virtual processors. Each PE has a left and right port. The width of the communication
`ports are defined at compile time. These ports are hard-wired together so that the
`right port of processor ‘1 and the left port of processor '1 + 1 share a register. An
`exception to this is on the Xilinx chip boundaries. The Splash 2 linear interconnect
`is used for chip-to-chip communication. On each chip, the left port of the first virtual
`processor on the chip and the right port of the last virtual processor on the chip are
`connected to XP_LEF'I‘ and XP_RIGHT, respectively.
`Communication of a value from a PE to its neighbor on the same chip requires
`one cycle. However, the need for inter-chip communication introduces delay. Since
`we latch data on chip boundaries, the inter-chip communication from the last PE on
`chip k to the first PE on chip k + 1 requires three cycles.3
`To accommodate these differences, we separate the net send instruction into two
`parts, a write and a read. As shown in Figure 7.9, the NETSENDl, which writes the
`value to the communicate output port, is dispatched three times. Then the NETSENDZ
`is sent. This instruction latches the input communicate port into the receiving register.
`
`3A possible alternative. to synchronize the data with a three—cycle pipeline on the internal con-
`nections. is too costly in logic.
`'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 414
`Petitioner Microsoft Corporation - EX. 1066, p. 414
`
`
`
`Section 7.4
`
`Global Operations
`
`89
`
`7.4.2 Reduction Operations .
`
`The notion of global combining, having all the His work together to produce a single
`result, is a key component of the SIMD processing model as defined by Hillis ['i‘].
`dbC has six primitive global reduction operators: MAX, MIN, SUM, AND, OR, and
`XOR. There is infix notation for each of these operators. For example, in
`
`poly int 61,
`int 1);
`b = >?= a;
`
`the operator >? = signifies max reduce.
`The programmer can invoke a reduce directly by using the special operators.
`Combining operators are also generated by the compiler when parallel control con-
`structs such as parallel-if or parallelawhile are used [12]. In addition, combining logic
`is generated for the DBC_read_fr0m_pr0C intrinsic.
`Global combining operations on traditional SIMD machines require a large
`amount of communication between the processors. This requirement is especially
`difficult for the Splash 2 implementation in that the crossbar, which could be used
`for inter-chip communication, is engaged in broadcasting instructions. Any other use
`of the crossbar interferes with the instruction pipeline. For this reason, we use the
`GLOBAL_OR lines from each computational chip to X0 and from X0 to the host to
`compute a “reduce” result bit-serially and send it to the host.
`Conceptually, global reduce operations on Splash 2 are performed in two stages.
`In the first stage, an intermediate result is computed for all of the His on a compu—
`tational chip. These local results are transmitted to the control chip, and the second
`stage computes the global result, which is transmitted to the host. However,
`this
`two—stage computation occurs bit-serially and the stages are heavily pipelined.
`When a PE receives a global reduce instruction, the PE sends its register data
`to a serializer component and its context to a reduce_context signal (refer to
`Figure 7.7). On the next cycle, the serializer shifts the data to the reduce components
`at a rate of two bits per cycle. The serial data is masked with the reduce_context
`signal. Each of the reduce componean collects the data from the serializers and per-
`forms the appropriate reduction on the bits. The internals of the AND, OR, and XOR
`are trivial. MIN and MAX are discussed in the next section. The control logic for the
`reduce components (Reduce Control) selects the 2-bit result from the correct reduce
`component and sends it to the X0 chip via XP_GOR and XP_GOR_VALID connections.
`On the control chip, X0, the 2—bit results from each of the 16 computational
`chips are collected on X0_GOR“RESULT_IN and X0_GOR_VALID_IN (refer to Fig-
`ure 18). Each of the reduce components collects the data and again performs the
`appropriate reduction on the bits. The control
`logic selects the 2-bit result from
`the correct reduce component and sends it to the host via the X0_GOR_VALID and
`x0 _GOR_RE SULT connections.
`
`There are many advantages to performing global combining operations bit-
`serially on Splash 2. One advantage is that the approach is easy to understand and
`implement. An obvious bit-serial path is available (and otherwise unused) in the
`form of the GLOBALOR lines. The instruction pipeline is not disturbed in order to '
`perform a global reduction. Another advantage is that the control logic required to
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 415
`Petitioner Microsoft Corporation - EX. 1066, p. 415
`
`
`
`
`
`90
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`drive the reduce is very fast and quite small. Since the same reduce logic is reused
`every cycle, the speed and size of the logic is largely independent of the size of the
`registers to be reduced. The size of the logic is a function of the number of PEs
`per chip and the kinds of reduces to be performed. An unexpected advantage to the
`bit-serial approach to global combining operations is that the reduce operation takes
`relatively few cycles to complete. A bit—serial global reduction requires (iv/21 +
`13 cycles,4 where w is the width of the destination register in bits. Thus, an 8-bit
`reduce of 64 processors requires 17 cycles. A 16-bit reduce takes 21 cycles.
`
`The MIN and MAX Global Combining Operations. A bit—serial approach
`to MIN and MAX global combining operations requires more thought than the AND,
`OR, and XOR operations. Nevertheless, the solution becomes evident if the data are
`processed most-significant—bit (msb) first. All that is required is one bit of state per
`processor to keep track of which processors are still participating in the computation.
`Let us first consider the generic problem of finding the maximum value in an array
`bit-serially.
`
`Bit-serial MAX. Our method of performing MAX reduce uses an algorithm
`that computes one binary digit of output per cycle, starting with the most significant
`bit (msb). The number of cycles is determined by the width of the variable to be
`reduced. The input at cycle 1 consists of the ith msb of the register from each of the
`prowssors. A mask register (one bit per processor) is maintained to determine which
`processors should be ignored in subsequent cycles. The algorithm has a few simple
`steps:
`
`1. The processor mask register is initialized to all Is.
`
`2. For each bit of the register:
`a) The processor window register contains the ith most significant bit from
`each prowssor.
`b) The window and mask registers are ANDed together.
`1:) If the result is nonzero, the mask is set to the result and a l
`(1) Otherwise. the mask is left unchanged and a 0 is output.
`
`is output.
`
`Example
`
`The following example traces a four—processor system with 6-bit registers. The proces—
`sors’ registers contain the following values:
`
`
`Processor
`
`Decima]
`
`Binary rcprcsentation
`
`pl)
`pl
`132
`p3
`
`000110
`6
`001001
`9
`001010
`10
`
`_11 001011
`
`
`
`“This includes a four-cycle instruction pipeline.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 416
`Petitioner Microsoft Corporation - EX. 1066, p. 416
`
`
`
`Section 7.4
`
`Global Operations
`
`91
`
`The mask is initialized to all Is for the first iteration. On subsequent cycles, the Window
`and Mask registers change as follows:
`
`
`Cycle
`
`Bit
`
`Window
`Mask
`Window AND Mask
`
`130131132133
`
`Output
`
`1
`2
`3
`4
`5
`6
`
`5
`4
`3
`2
`l
`0
`
`0000
`0000
`0111
`1000
`l01]
`0101
`
`0
`0000
`1111
`0
`0000
`1111
`l
`01“
`1111
`0
`0000
`0111
`l
`0011
`0111
`
`
`00010011 l
`
`The output bit-serially generated is 001011 (decimal = 11).
`
`Bit-serial MIN. The algorithm for finding the minimum value bit-serially is
`virtually identical to finding the maximum value. In essence, we find the maximum
`value of the one’ s-complement of the poly register and the one’ s-complement of the
`result is the answer. The bit-serial reduce MAX is modified in two simple ways to
`get reduce MIN: the serial input (window) and the serial output are inverted.
`The following example, as in the MAX example, uses the four processor system
`with 6-bit registers. The processors’ registers contain the values 6, 9, 10, and 11.
`The mask is initialized to all Is for the first iteration. On subsequent cycles,
`the Window and Mask registers change as follows:
`
`
`Cycle
`
`Bit
`
`Window*
`13013113293
`
`Mask
`
`Window AND Mask
`
`Output*
`
`1
`2
`3
`4
`5
`6
`
`5
`4
`3
`2
`1
`o
`
`o
`1111
`1111
`1111
`o
`1111
`1111
`1111
`o
`1000
`1000
`1000
`1
`00110
`1000
`0111
`1
`0000
`1000
`0100
`
`
`
`1000 10001010 o
`
`*Note that the bits are simply inverted from the reduce MAX example.
`
`The output bit—serially generated is 000110 (decimal = 6), which is indeed the mini-
`mum value.
`
`7.4.3 HostIProcessor Communication
`
`Four dbC intrinsics are available for communication between the host and the proces—
`
`sor array. DEC_read_from_proc reads a parallel variable from a specific processor.
`
`
`
`HC_read_from_all reads a parallel value from each processor into an array in
`the host. DBC_write_to..proc writes a value from the host to a specific processor.-
`
`Finally, :3130_wrice_co_a11 spreads a host array onto the virtual processor array,
`one element per processor.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 417
`Petitioner Microsoft Corporation - EX. 1066, p. 417
`
`
`
`
`
`92
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`The DBc_read_£rom_proc Operation. The DBC_read_fr0m_proc intrin—
`sic is implemented on Splash 2 as a modified reduce OR. The host specifies the PE to
`be read via the PE# field of the SIMD instruction. The PE Select component compares
`the PE# field to the processor ID (iproc) of each PE (see Figure 7.7). If the iproc
`
`and PE# are equal, the pe_Selected signal is set to a l. The :313C_read_from_proc
`call causes the register to be passed to the serializer, as with any global reduce oper-
`ation. However, the reduce_context signal is set to pe-se1ected in place of the
`PEs context bit. This effectively masks out all PEs except for the PE selected by
`the host.
`
`the logic
`This approach to DBC_read_from_proc has the advantage that
`required for the operation is negligible. If a' global combining operation such as
`or_reduce is used by the design, that logic is recycled by DBC-read_from_proc.
`
`
`
`The DBC_read_£rom_all Operation. The HC_read_from_a11 intrinsic
`copies all of the values of a poly register held by PEs to an array on the host. This
`function could be implemented as n iterations of DBC_read_fr0m_pr0C (where n
`is the number of PEs). However, we chose a more efficient method. The Splash 2
`implementation uses it iterations of left-to-right communicate to get the poly registers
`to the host. The last PE’s right port is connected XP_RIGHT on the computational
`chip (refer to Figure 7.7). The XP_RIGHT port of the last computational chip in the
`Splash 2 linear path is connected to the host via the RBUS (see Figure 2.1). As the
`host issues it iterations of left-to-right net send calls, it reads data from the RBUS. The
`data appear on the RBUS in reverse order (n, n — l, n —2. .
`. .). Consequently, the host
`fills the destination array backwards. As described in Section 7.4.1, a DBCmet_send
`instruction requires four cycles. Utilizing the net send, the DBC_read_fr0m_all
`intrinsic requires 4:: cycles to complete. This is much more efficient than n iterations
`of a 13+ cycle DBC_read_from_proc.
`
`
`Writing to the Processor Array. The D3C_write_t0_proc implementation
`on Splash 2 is quite simple due to the fact that both the instruction and the data
`flow in the same direction, from host to PE array. As described in Section 7.3.3, the
`SIMD instruction generated by the host has three fields: opcode, operand, and PE#.
`The operand field contains the data to be written, and the PE# field contains the ID
`of the destination PE. If the PE# field of the SIMD instruction matches the iproc
`of the processor, the register is assigned the value of the operand field. Otherwise,
`the instruction is ignored. A DBC_write_to_proc call takes only one cycle.
`The DBC_write_to_all is implemented as a series of n DBC_write_to_proc
`calls, where n is the number of PEs. This operation requires :1 cycles.
`
`7.5 OPTIMIZATION: MACRO INSTRUCTIONS
`
`Our simple cross-correlation program has approximately 10 instructions. A more
`realistic application would result in many tens more. It is advantageous to reduce the
`number of parallel instructions for a variety of reasons. Fewer instructions require
`a smaller decoder, a savings in logic. By scheduling independent SIMD operations
`at the same clock, we introduce new instruction—level parallelism within a SIMD
`PE. Even more compelling, a single powerful multi—tick instruction can be clocked
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 418
`Petitioner Microsoft Corporation - EX. 1066, p. 418
`J
`
`
`
`Section 7.5
`
`Optimization: Macro Instructions
`
`93
`
`independently at the processor array, allowing the Splash 2 system to run at a faster
`rate than the host SPARCstation can drive it. Our compiler,
`therefore,
`identifies
`opportunities for multi-tick operations and synthesizes multi-tick instructions, which
`are activated by a single opcode.
`One such category of multi-tick operations is the reduce family of instructions
`discussed in the previous section. Another category, which we describe here,
`is
`parallel basic blocks, from which the compiler creates “macro instructions.” A single
`macro instruction dispatched from the host initiates a multi-tick instruction in which
`one or more generic SIMD operations occur concurrently on each PE.
`
`7.5.1 Creating a Macro Instruction
`
`A basic block consists of a sequence of computation with a single entry point at the
`top of the block, a single exit at the bottom, and no branching into or out of the block
`except through the single entry and exit. Figure 7.10 shows a basic block written in
`dbC and the corresponding generic SlMD code.
`the operations in
`The macro instruction scheduler attempts to schedule all
`the block to occur in a single clock tick. This is possible only if there are no
`interoperation dependencies. For example, instructions (1) and (2) in Figure 7.10
`are independent and can safely occur
`in a single clock tick, but
`instruc-
`tion (4) depends on the completion of instruction (3). Only after t3 is registered
`can the add occur.
`
`#define N 16
`
`poly unsigned uzN, VtN, w:N, k:N;
`
`= DBC_iproc[0 +: N];
`(poly unsignedzN)
`= V — 1;
`= u l v L k;
`
`€W<1C
`
`(DBC_nproc +1} — DBC_iproc[0 +: N];
`
`(l) opParMove_lL_a (u . address , opParAddOf fset (DBC_ipr0c . address ,
`{0) )
`, 16} r
`
`(2} opParMoveC_lL__a topParAddOffset (_DBC_poly_frame_t_main, 1)
`/* t3:16:l *l,
`[DBC_npr0C + 1),
`l6};
`( 3 ) opParSub3_1L_a (v . addres s ,
`opParAddOffset (_DBC_poly_frarne__t_main,
`/*t3:16:l */,
`OpParAddOf fset (DBC_ipr0c . address ,
`l0) ) i 16) r
`
`l)
`
`l, 16] ,-
`(4} opParSub3c_lL_a (k.address, v.address,
`(5) opParBor3_lL‘_a (opParAddOffset (_DBC_poly_frame_t_mair1, 1)
`/* t5:16:l*/, u-address, v.address,
`l6];
`(6) opParBor3_1L_a (w. address , opParAddOffset
`(_DEC_poly_frame_t_main, 1)
`/*t5:16:1 *l, k.address, 16) ,-
`
`FIGURE 7.10 A Basic Block and Its Translation to Generic SIMD
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 419
`Petitioner Microsoft Corporation - EX. 1066, p. 419
`
`
`
`
`
`94
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`0
`
`0
`
`0
`
`FIGURE 7.1 1 Dependency Graph
`
`the scheduler constructs a dependency matrix M to
`For each basic block,
`reflect the dependencies among parallel instructions. M [i, j] = 1 implies that instruc-
`tion (j) depends on instruction (1'). Then, an As Soon As Possible (ASAP) scheduling
`algorithm is used to sequence the parallel instructions. All instructions j such that
`M[*, j] = 0 can be scheduled in the current tick. Once an instruction j has been
`scheduled, M [j, *1 is set to 0, allowing the instructions that depend on j to be sched-
`uled in the next iteration of the algorithm. Figure 7.11 shows the dependency graph
`for this example.
`The compiler generates a single opcode for the macro instruction. When that
`opcode is issued, a subinstruction shift register is used to sequence through the
`subinstructions. For this example, a 4utick instruction is issued. The sequencing of
`ticks is controlled by a 4-bit shift register.
`
`7.5.2 Discussion
`
`Our approach differs from more general hi gh—level synthesis systems in two respects.
`First, since we focus on the SIMD model, control flow is managed by the host. We
`are concerned only with basic blocks of parallel instructions and need not build and
`schedule a general controluflow graph as is done by the IBM [2] and similar systems.
`Second, in contrast to most high-level synthesis systems that synthesize logic for
`a single chip, we focus on synthesis of the entire FPGA-based parallel computing
`system. Our efforts are directed toward synthesis for the Splash logic array, of which
`generating logic for individual chips is one (important) part. We use a commercial
`FPGA compiler to further optimize the VHDL generated by our system.
`
`7.6 EVALUATION: GENETIC DATABASE SEARCH
`
`Applications involving search for similarity in genome strings have been mapped
`successfully to Splash l [4] and Splash 2 [8]. We have compiled a dbC version of
`this application for Splash 2. A source stream is stored acros