throbber

`
`82
`
`A Data Parallel Programming Model
`
`Chapter 7
`
` ,
`
`
`:
`CounterN
`
`Counteril Counter]:
`
`FIGURE 7.3 Bitstream
`Cross—Correlation
`
`is incremented. Comparison of the two bits can be performed with exclusive—or or
`exclusive—nor. At least one of the bitstream data streams moves from cell to cell, but
`the counter data is stationary, as shown in Figure 7.3.
`In this example, stream a is shifted systolically through the array, while stream b
`is broadcast to each PB. The counter R accumulates the result of the correlation.
`The user specifies the size and shape of the processor array by initializing the
`predefined variables DBC_ne-I;, which gives the number of dimensions, and DBC_ne-t_
`shape, which gives the rank of each dimension. The dbCISplash 2 compiler currently
`supports only linear arrays.
`The keyword poly indicates the declaration of parallel variables or data types.
`Integers and logicals in the parallel domain may have arbitrary user-defined bit length.
`In the example program, the variable a is a one-bit parallel variable, and the counter R,
`which holds the result, is a 16—bit unsigned parallel integer. The variable In is a normal
`C variable that is stored on the host.
`.
`
`The all keyword indicates that all processors are to participate in the body of
`the compound statement. The body of the all contains an initialization of the parallel
`variable a and a sequential for statement. Within the for loop,
`there are three
`statements. The first statement initiates host-to-processor communication: the host
`writes a l
`to processor 0’3 51. The second updates the counter R. On each processor,
`the result of a XORed with the least—order bit of b is added into R. Finally, each
`processor in the linear array shifts its value of a to the right. The final statement of
`the loop simply reads and prints the value of the counter R from a specific processor
`(b mod 64). This type of register examination has traditionally been very difficult
`for programmers to design into Splash 2 programs.
`
`7.3 COMPILING FROM dbC TO SPLASH 2
`
`Compiling dbC programs for the Splash 2 system occurs in two phases. First, the
`dbC translator emits sequential C code with embedded parallel instructions. These
`parallel
`instructions are three-address memory—to—memory instructions (“Generic
`SIMD”).
`.
`
`When a dbC program is compiled for a traditional SIMD machine (CM-2 or
`TERASYS), the generic SIMD instructions are interpreted by microcode libraries
`(such as the Paris microcode library for the CM—2). These runtime libraries also
`support intrinsic functions such as DBC_read_from_pr0C and DBC_net_send.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 408
`Petitioner Microsoft Corporation - EX. 1066, p. 408
`
`

`

`Section 7.3
`
`Compiling from dbC to Splash 2
`
`83
`
`To compile for Splash 2, an additional compilation phase is invoked. A Splash 2
`specific phase focuses on the parallel instructions that were created by the previous
`phase, and assembles from those parallel instructions a specialized SIMD engine in
`structural and behavioral VHDL.
`
`The virtual PEs making up this SIMD engine have an instruction set containing
`all of the generic SIMD instructions appearing in the generated C code. Thus the
`compiler synthesizes a different instruction set for each different program.
`In addition to constructing a custom SIMD engine for the application on
`Splash 2, the dbC compiler also generates the host program. This program executes
`the sequential operations of the dbC program on the host and sends parallel instruc-
`tions to the Splash 2 Array Board in the sequence specified by the dbC program. The
`compilation steps are enumerated below.
`
`7.3.1 Creating a Specialized SIMD Engine
`
`We demonstrate the steps required to configure Splash 2 as a SIMD machine with
`the small example of Section 7.2.2.
`Phase 1 of the dbC translator does item 1 below. All subsequent steps are
`performed by the Splash 2 specific phase 2. Starting from the dbC program,
`the
`steps are:
`
`1. Generation of the Generic SIMD code.
`
`2. Determination of registers and data movement between registers. The data path,
`rather than being the generalized data path found in general-purpose computers,
`is customized on a per-program basis.
`3. Determination of the control structure, that is, what decoders for instructions are
`needed and what those decoders must control. The decoders are also customized
`
`for the program.
`
`4. Establishment of inter-PE (and inter-chip) data paths and state machines for
`nearest—neighbor communication.
`
`5. Establishment of inter—PE (and inter-chip) data paths and state machines for
`global combining operations. The Xi’s, X0, and host must synchronize during
`a global reduce.
`6. Generation of:
`
`a) VHDL types for the data types;
`b) VHDL SIGNALS for the variables;
`c) VHDL control statements for the instruction decode;
`(1) Appropriate VHDL assignment statements for each of the operators;
`8) Port declarations and interconnection to support nearest-neighbor commu-
`nication and global reduction;
`1') Generation of state machines to sequence the multi-tick operations.
`
`7. Generation of the host program to perform sequential operations and send par-
`allel instructions to the Splash 2 Array Board.
`
`8. Synthesis of VHDL to Xilinx-specific configuration bitstreams, which are down«
`loaded to the chips. This process uses commercial CAD tools.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 409
`Petitioner Microsoft Corporation - EX. 1066, p. 409
`
`
`
`

`

`
`
`84
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`OpParMoveZero_lL_a (a . address .
`for (b=0;b<128:b——+){
`l),-
`l, 0,
`DBC_write_to_proc(a.address,
`opPaer0r3C_lea (opParAddOffset (_DBC_poly_frame_t_rnain, 2)
`/* t3:;:2 */, a.addrese, b, 1);
`0pParAdd2_2 L_a [R . address ,
`opParAddOffset{_DBC_poly_frame_t_main, 2)
`/* t3:Z_:2 */, 15, 1):“
`
`:3BC_net_send(a.address. a, right,
`1},-
`printf("%d \n", DBC_readr_fr0m_proctR, b%64] );
`/* end for */
`
`l) :
`
`
`
`}
`
`FIGURE 7.4
`
`C + Generic SIMD Code for Correlation
`
`7.3.2 Generic SIMD Code
`
`To begin the process of compiling the correlation program for Splash 2, we trans-
`late the dbC to sequential C plus calls to Generic SIMD operators. A fragment of
`the Generic SIMD code for our correlation program is shown in Figure 7.4. Each
`“function” call prefaced by opPar is a Generic SIMD instruction.
`The instruction name describes both function and parameters. The op Par prefix
`is followed by the operation, for example, MoveZero in the first instruction. Next,
`many instructions have a number signifying the number of operands, for example
`the “3” in the opPaeror3C_lL instruction. If one of the operands is a constant, as
`in the XOR instruction, a c follows. Next, after the underscore, the number of bit
`
`lengths that will follow is specified as a number followed by L. A final suffix _a
`indicates that the operation is to be performed unconditionally on each PE (even if
`the context bit is reset).
`In our example, the MoveZero instruction clears the single—bit parallel vari-
`able at. The intrinsic DBC_write_to_proc writes a 1
`into processor 0’s a. The
`onr3c instruction performs a Boolean XOR of a and the least—order bit of ‘12
`into a compiler-generated temporary. Next that temp is added into R in the Addz
`instruction. The DBC_net_send shifts at from each PE to its right neighbor. Finally,
`the DBC_read_from_proc reads R from processor i.
`
`7.3.3 Generating VHDl.
`
`In the second phase of compilation to Splash 2, the Generic SIMD code is processed
`by a specialized backend. The Splash 2 specific Phase 2 generates two chip descrip—
`tions, which are VHDL programs for computational chips Xl—Xlo and the control
`chip X0, respectively. The computational chips hold the SIMD Processing Elements,
`while the X0 control chip is used for host-PE communications, global combining,
`and instruction broadcast.
`
`In Phase 2, we create an instruction set derived from the opPar commands
`
`and intrinsic calls generated by Phase 1. The instruction set for this SIMD engine
`
`array.
`
`2The latter is a variable on the host, and therefore a constant from the point of view of the SIMD
`'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 410
`Petitioner Microsoft Corporation - EX. 1066, p. 410
`A
`
`

`

`Section 7.3
`
`Compiling from dbC to Splash 2
`
`85
`
`33 32
`
`Instruction
`24
`'
`
`3
`
`0
`
`
`II- I
`
`start mulli—tick operation
`stop muiti—tick operation
`
`FIGURE 7.5 SIMD Instruction Format
`
`is customized to the specific instruction instance. For example, the Boolean XOR
`instruction we synthesize expects the operands to be the variable a and the least—order
`bit of b and the result to go into t3. Thus there is no need for runtime computation
`of source and destination, at data path to compute and gain access to arbitrary source
`and destination, or much of the other complexity that comes with a general-purpose
`instruction set.
`'
`
`The instructions are in a fixed format, shown in Figure 7.5. The least—order
`eight bits contain the opcode. The next 16 bits, labeled “Operand” in the figure,
`contain an immediate value, if required by the instruction. For example, our XOR
`instruction, the opPaeror3 C_1L_a, requires a constant to be passed as one of the
`operands to the operation. In the example (see Figure 7.4), the current value of b
`is the second operand of the XOR, and thus gets passed to each PE through the
`Operand field.
`The third field of the instruction, PE#, is an optional processor number used
`for those instructions that are to be executed only by specific single processors. In
`our cross-correlation program, for example, the DBC_read_from_proc instruction
`reads the result from a different processor on each iteration of the loop, Processor to.
`The generated instruction therefore wfims the current value of b mod 64 in the PE#
`field. The final high-order bits are used to synchronize between X0 and the host in
`multi-tick operations.
`Figure 7.6 outlines the interaction between the host and the generated SIMD
`engine. The controlling program on the host executes a sequence of instructions, some
`of which are executed locally on the host and some of which are sent as commands
`to the SIMD engine. In our example, the loop control of the for-loop is done on the
`
`f
`
`Chilefi
`PE PE PE
`61 62 63
`
`control Io_ ic
` control logic
`
`Inshuction
`and Data
`
`SPARE HOST
`
`FIGURE 7.6 The Generated SIMD
`Engine for Cross—Correlation
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 411
`Petitioner Microsoft Corporation - Ex. 1066, p. 411
`
`

`

`86
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`Mulli—Tick
`Decoder
`(ibus.lick)
`
`,
`
`-
`
`..'
`
`Operand
`Pipeline
`{ibusopemnd}
`
`Instruction
`
`.:_ : xP_erHr
`
`
`
`XPJJUR
`
`XP_GOR_\«" ALJD
`
`FIGURE 7.7 A Single Computational Chip
`
`host. In the body of the loop, the parallel instructions are broadcast to the SIMD
`engine, with each PE (0—63) operating independently on its data.
`Figure 7.? shows the layout of each computational chip. Multiple PEs are
`instantiated on each chip, with the number of PEs per chip being determined by
`the user-specified size of the processor array. In this example, there are a total of
`64 P135, so four are placed on each of the 16 chips. Instructions are sent through the
`XL FIFO to X0, which broadcasts them over the crossbar to each chip. At the chip,
`the instruction is decoded and sent to all the SIMD PES on the chip, along with the
`Operand and PE#.
`This program contains a call to the intrinsic DBCfiroad_from_prOC, which
`returns to the host the value of R on a specific PE. We use a special form of the
`“reduce logic" (see Section 7.4 on Global Operations) to implement this instruction.
`The figure shows that
`the SIMD instruction (“ibus”) comes into the chip on the
`XP_XBAR port. There the opcode is decoded, and the decoded opcode, along with
`the other fields, is passed to each SIMD PE. The SIMD processors contain logic to
`execute the instructions. In addition, for this correlation program, they are connected
`to each other linearly through the “communicate bus” over which the value of a is
`shifted right. The “reduce data” shown flowing out of each SIMD PE is the value of
`R, which is read from successive His and sent to the host. The serializers and reduce
`
`logic are explained in Section 7.4. The value of R from the selected PE is sent two
`bits at a time out the XP_GOR and XP_GOR_VALID lines to X0.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 412
`Petitioner Microsoft Corporation - Ex. 1066, p. 412
`
`

`

`Section 7.3
`
`Compiling from dbC to Splash 2
`
`87
`
`XOFSIMD
`(from SPARC host)
`
`Multi-Ticlr
`Decoder
`
`X0_XB_DATA
`(to XP chips)
`
`XO_GOR_RESULT 1N
`X0_GOR_VALID_]N
`(from XP chips)
`
`instruction
`Decoder
`
`(ibus.'|nstr ction)
`
`
`
`XP_GOR_VALID
`
`KP_GOR_RESULT
`
`FIGURE 7.8 The Control Chip
`
`instruction
`Figure 7.8 shows the layout of the control chip X0. The 36—bit
`word comes in on X0_SIMD from the SPARC host. XO’s instruction decoder and
`multi-tick decoder are identical to those of the computational chips. The reduce logic
`is similar to that of the computational chips, with the major difference being that
`the data enters bit-serially on the X0_GOR_RESULT_IN and X0_GOR_VALID_IN
`lines and goes to the host bit serially on X0_GOR_RESULT and X0_GOR_
`VALID.
`
`The final step is to generate the host program. Each opPar instruction is
`replaced with a Splash 2 specific instruction. as shown in Figure 7.9. A SPLASH-
`TNSTRUCTION simply writes the parameter to the operand field of the SIMD Bus and
`then steps the clock, which issues the instruction to X0. The SPLASH_W_INSTRUCTION
`'writes an 8 to the opcode field, a l
`to operand, and a 0 to the PE# field. The
`SPLASHJiPJNSTRUCTION writes a 3 to opcode, a 16 to operand (the bit length,
`which is required to control the reduce logic), and the current value of i modulo 64
`to the PE#. Note that the modular reduction is performed on the host, and the result
`of the mod is sent to the processor array. The DBC_net_Send instruction is broken
`into two parts:
`the first copies a to the communicate output port, and the second
`copies the communicate input port into a. The nearest—neighbor communication is
`explained further in the next section.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 413
`Petitioner Microsoft Corporation - EX. 1066, p. 413
`
`

`

`
`
`88
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`/* OPPARMOVEleL */
`SPLASH_INSTRUCTION(9);
`f0rli:0;i<l28;i++}{
`SPLASH_W_INSTRUCTION(8 /* OPPARWRITETOPROC */, 1, 0);
`SPLASH_INSTRUCT:0N(7);
`/* OPPARBXOR3C_1L */
`SPLASH_INSTRUCT:0N(61;
`/* OPPARADD2_2L */
`SPLASH_INSTRUCT:ON(5}:
`SPLASH_INSTRUCT:ON(5);
`/* OPPARNETSENDl */
`SPLASH_INSTRUCTION(5);
`/* OPPARNETSENDZ */
`SPLASH_INSTRUCTION(4};
`printf('%d \n', SPLASHWRP_INSTRUCTION{3, 16
`/* OPPARREADFROMPROC */,
`ti%64)});
`
`}
`
`
`
`FIGURE 7.9
`
`Final C Program for Correlation
`
`7.4 GLOBAL OPERATIONS
`
`The data parallel model encompasses a number of global operations in which all
`(active) Processing Elements participate. 0n Splash 2, we have implemented two
`classes of data parallel operations,
`
`0 DBC_net_send, a nearest-neighbor linear communication pattern in which each
`PE sends a value to its right neighbor. PE 0 receives its value from the host
`through the Operand field of the instruction.
`
`0 Reduce operators MAX, MIN, AND, OR, and XOR, which perform the indi—
`cated operation across the entire virtual PE array and send the result back to
`the host.
`
`7.4.1 Nearest-Neighbor Communication
`
`Left—to-right communication is accomplished with structural connections between
`virtual processors. Each PE has a left and right port. The width of the communication
`ports are defined at compile time. These ports are hard-wired together so that the
`right port of processor ‘1 and the left port of processor '1 + 1 share a register. An
`exception to this is on the Xilinx chip boundaries. The Splash 2 linear interconnect
`is used for chip-to-chip communication. On each chip, the left port of the first virtual
`processor on the chip and the right port of the last virtual processor on the chip are
`connected to XP_LEF'I‘ and XP_RIGHT, respectively.
`Communication of a value from a PE to its neighbor on the same chip requires
`one cycle. However, the need for inter-chip communication introduces delay. Since
`we latch data on chip boundaries, the inter-chip communication from the last PE on
`chip k to the first PE on chip k + 1 requires three cycles.3
`To accommodate these differences, we separate the net send instruction into two
`parts, a write and a read. As shown in Figure 7.9, the NETSENDl, which writes the
`value to the communicate output port, is dispatched three times. Then the NETSENDZ
`is sent. This instruction latches the input communicate port into the receiving register.
`
`3A possible alternative. to synchronize the data with a three—cycle pipeline on the internal con-
`nections. is too costly in logic.
`'
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 414
`Petitioner Microsoft Corporation - EX. 1066, p. 414
`
`

`

`Section 7.4
`
`Global Operations
`
`89
`
`7.4.2 Reduction Operations .
`
`The notion of global combining, having all the His work together to produce a single
`result, is a key component of the SIMD processing model as defined by Hillis ['i‘].
`dbC has six primitive global reduction operators: MAX, MIN, SUM, AND, OR, and
`XOR. There is infix notation for each of these operators. For example, in
`
`poly int 61,
`int 1);
`b = >?= a;
`
`the operator >? = signifies max reduce.
`The programmer can invoke a reduce directly by using the special operators.
`Combining operators are also generated by the compiler when parallel control con-
`structs such as parallel-if or parallelawhile are used [12]. In addition, combining logic
`is generated for the DBC_read_fr0m_pr0C intrinsic.
`Global combining operations on traditional SIMD machines require a large
`amount of communication between the processors. This requirement is especially
`difficult for the Splash 2 implementation in that the crossbar, which could be used
`for inter-chip communication, is engaged in broadcasting instructions. Any other use
`of the crossbar interferes with the instruction pipeline. For this reason, we use the
`GLOBAL_OR lines from each computational chip to X0 and from X0 to the host to
`compute a “reduce” result bit-serially and send it to the host.
`Conceptually, global reduce operations on Splash 2 are performed in two stages.
`In the first stage, an intermediate result is computed for all of the His on a compu—
`tational chip. These local results are transmitted to the control chip, and the second
`stage computes the global result, which is transmitted to the host. However,
`this
`two—stage computation occurs bit-serially and the stages are heavily pipelined.
`When a PE receives a global reduce instruction, the PE sends its register data
`to a serializer component and its context to a reduce_context signal (refer to
`Figure 7.7). On the next cycle, the serializer shifts the data to the reduce components
`at a rate of two bits per cycle. The serial data is masked with the reduce_context
`signal. Each of the reduce componean collects the data from the serializers and per-
`forms the appropriate reduction on the bits. The internals of the AND, OR, and XOR
`are trivial. MIN and MAX are discussed in the next section. The control logic for the
`reduce components (Reduce Control) selects the 2-bit result from the correct reduce
`component and sends it to the X0 chip via XP_GOR and XP_GOR_VALID connections.
`On the control chip, X0, the 2—bit results from each of the 16 computational
`chips are collected on X0_GOR“RESULT_IN and X0_GOR_VALID_IN (refer to Fig-
`ure 18). Each of the reduce components collects the data and again performs the
`appropriate reduction on the bits. The control
`logic selects the 2-bit result from
`the correct reduce component and sends it to the host via the X0_GOR_VALID and
`x0 _GOR_RE SULT connections.
`
`There are many advantages to performing global combining operations bit-
`serially on Splash 2. One advantage is that the approach is easy to understand and
`implement. An obvious bit-serial path is available (and otherwise unused) in the
`form of the GLOBALOR lines. The instruction pipeline is not disturbed in order to '
`perform a global reduction. Another advantage is that the control logic required to
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 415
`Petitioner Microsoft Corporation - EX. 1066, p. 415
`
`

`

`
`
`90
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`drive the reduce is very fast and quite small. Since the same reduce logic is reused
`every cycle, the speed and size of the logic is largely independent of the size of the
`registers to be reduced. The size of the logic is a function of the number of PEs
`per chip and the kinds of reduces to be performed. An unexpected advantage to the
`bit-serial approach to global combining operations is that the reduce operation takes
`relatively few cycles to complete. A bit—serial global reduction requires (iv/21 +
`13 cycles,4 where w is the width of the destination register in bits. Thus, an 8-bit
`reduce of 64 processors requires 17 cycles. A 16-bit reduce takes 21 cycles.
`
`The MIN and MAX Global Combining Operations. A bit—serial approach
`to MIN and MAX global combining operations requires more thought than the AND,
`OR, and XOR operations. Nevertheless, the solution becomes evident if the data are
`processed most-significant—bit (msb) first. All that is required is one bit of state per
`processor to keep track of which processors are still participating in the computation.
`Let us first consider the generic problem of finding the maximum value in an array
`bit-serially.
`
`Bit-serial MAX. Our method of performing MAX reduce uses an algorithm
`that computes one binary digit of output per cycle, starting with the most significant
`bit (msb). The number of cycles is determined by the width of the variable to be
`reduced. The input at cycle 1 consists of the ith msb of the register from each of the
`prowssors. A mask register (one bit per processor) is maintained to determine which
`processors should be ignored in subsequent cycles. The algorithm has a few simple
`steps:
`
`1. The processor mask register is initialized to all Is.
`
`2. For each bit of the register:
`a) The processor window register contains the ith most significant bit from
`each prowssor.
`b) The window and mask registers are ANDed together.
`1:) If the result is nonzero, the mask is set to the result and a l
`(1) Otherwise. the mask is left unchanged and a 0 is output.
`
`is output.
`
`Example
`
`The following example traces a four—processor system with 6-bit registers. The proces—
`sors’ registers contain the following values:
`
`
`Processor
`
`Decima]
`
`Binary rcprcsentation
`
`pl)
`pl
`132
`p3
`
`000110
`6
`001001
`9
`001010
`10
`
`_11 001011
`
`
`
`“This includes a four-cycle instruction pipeline.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 416
`Petitioner Microsoft Corporation - EX. 1066, p. 416
`
`

`

`Section 7.4
`
`Global Operations
`
`91
`
`The mask is initialized to all Is for the first iteration. On subsequent cycles, the Window
`and Mask registers change as follows:
`
`
`Cycle
`
`Bit
`
`Window
`Mask
`Window AND Mask
`
`130131132133
`
`Output
`
`1
`2
`3
`4
`5
`6
`
`5
`4
`3
`2
`l
`0
`
`0000
`0000
`0111
`1000
`l01]
`0101
`
`0
`0000
`1111
`0
`0000
`1111
`l
`01“
`1111
`0
`0000
`0111
`l
`0011
`0111
`
`
`00010011 l
`
`The output bit-serially generated is 001011 (decimal = 11).
`
`Bit-serial MIN. The algorithm for finding the minimum value bit-serially is
`virtually identical to finding the maximum value. In essence, we find the maximum
`value of the one’ s-complement of the poly register and the one’ s-complement of the
`result is the answer. The bit-serial reduce MAX is modified in two simple ways to
`get reduce MIN: the serial input (window) and the serial output are inverted.
`The following example, as in the MAX example, uses the four processor system
`with 6-bit registers. The processors’ registers contain the values 6, 9, 10, and 11.
`The mask is initialized to all Is for the first iteration. On subsequent cycles,
`the Window and Mask registers change as follows:
`
`
`Cycle
`
`Bit
`
`Window*
`13013113293
`
`Mask
`
`Window AND Mask
`
`Output*
`
`1
`2
`3
`4
`5
`6
`
`5
`4
`3
`2
`1
`o
`
`o
`1111
`1111
`1111
`o
`1111
`1111
`1111
`o
`1000
`1000
`1000
`1
`00110
`1000
`0111
`1
`0000
`1000
`0100
`
`
`
`1000 10001010 o
`
`*Note that the bits are simply inverted from the reduce MAX example.
`
`The output bit—serially generated is 000110 (decimal = 6), which is indeed the mini-
`mum value.
`
`7.4.3 HostIProcessor Communication
`
`Four dbC intrinsics are available for communication between the host and the proces—
`
`sor array. DEC_read_from_proc reads a parallel variable from a specific processor.
`
`
`
`HC_read_from_all reads a parallel value from each processor into an array in
`the host. DBC_write_to..proc writes a value from the host to a specific processor.-
`
`Finally, :3130_wrice_co_a11 spreads a host array onto the virtual processor array,
`one element per processor.
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 417
`Petitioner Microsoft Corporation - EX. 1066, p. 417
`
`

`

`
`
`92
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`The DBc_read_£rom_proc Operation. The DBC_read_fr0m_proc intrin—
`sic is implemented on Splash 2 as a modified reduce OR. The host specifies the PE to
`be read via the PE# field of the SIMD instruction. The PE Select component compares
`the PE# field to the processor ID (iproc) of each PE (see Figure 7.7). If the iproc
`
`and PE# are equal, the pe_Selected signal is set to a l. The :313C_read_from_proc
`call causes the register to be passed to the serializer, as with any global reduce oper-
`ation. However, the reduce_context signal is set to pe-se1ected in place of the
`PEs context bit. This effectively masks out all PEs except for the PE selected by
`the host.
`
`the logic
`This approach to DBC_read_from_proc has the advantage that
`required for the operation is negligible. If a' global combining operation such as
`or_reduce is used by the design, that logic is recycled by DBC-read_from_proc.
`
`
`
`The DBC_read_£rom_all Operation. The HC_read_from_a11 intrinsic
`copies all of the values of a poly register held by PEs to an array on the host. This
`function could be implemented as n iterations of DBC_read_fr0m_pr0C (where n
`is the number of PEs). However, we chose a more efficient method. The Splash 2
`implementation uses it iterations of left-to-right communicate to get the poly registers
`to the host. The last PE’s right port is connected XP_RIGHT on the computational
`chip (refer to Figure 7.7). The XP_RIGHT port of the last computational chip in the
`Splash 2 linear path is connected to the host via the RBUS (see Figure 2.1). As the
`host issues it iterations of left-to-right net send calls, it reads data from the RBUS. The
`data appear on the RBUS in reverse order (n, n — l, n —2. .
`. .). Consequently, the host
`fills the destination array backwards. As described in Section 7.4.1, a DBCmet_send
`instruction requires four cycles. Utilizing the net send, the DBC_read_fr0m_all
`intrinsic requires 4:: cycles to complete. This is much more efficient than n iterations
`of a 13+ cycle DBC_read_from_proc.
`
`
`Writing to the Processor Array. The D3C_write_t0_proc implementation
`on Splash 2 is quite simple due to the fact that both the instruction and the data
`flow in the same direction, from host to PE array. As described in Section 7.3.3, the
`SIMD instruction generated by the host has three fields: opcode, operand, and PE#.
`The operand field contains the data to be written, and the PE# field contains the ID
`of the destination PE. If the PE# field of the SIMD instruction matches the iproc
`of the processor, the register is assigned the value of the operand field. Otherwise,
`the instruction is ignored. A DBC_write_to_proc call takes only one cycle.
`The DBC_write_to_all is implemented as a series of n DBC_write_to_proc
`calls, where n is the number of PEs. This operation requires :1 cycles.
`
`7.5 OPTIMIZATION: MACRO INSTRUCTIONS
`
`Our simple cross-correlation program has approximately 10 instructions. A more
`realistic application would result in many tens more. It is advantageous to reduce the
`number of parallel instructions for a variety of reasons. Fewer instructions require
`a smaller decoder, a savings in logic. By scheduling independent SIMD operations
`at the same clock, we introduce new instruction—level parallelism within a SIMD
`PE. Even more compelling, a single powerful multi—tick instruction can be clocked
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 418
`Petitioner Microsoft Corporation - EX. 1066, p. 418
`J
`
`

`

`Section 7.5
`
`Optimization: Macro Instructions
`
`93
`
`independently at the processor array, allowing the Splash 2 system to run at a faster
`rate than the host SPARCstation can drive it. Our compiler,
`therefore,
`identifies
`opportunities for multi-tick operations and synthesizes multi-tick instructions, which
`are activated by a single opcode.
`One such category of multi-tick operations is the reduce family of instructions
`discussed in the previous section. Another category, which we describe here,
`is
`parallel basic blocks, from which the compiler creates “macro instructions.” A single
`macro instruction dispatched from the host initiates a multi-tick instruction in which
`one or more generic SIMD operations occur concurrently on each PE.
`
`7.5.1 Creating a Macro Instruction
`
`A basic block consists of a sequence of computation with a single entry point at the
`top of the block, a single exit at the bottom, and no branching into or out of the block
`except through the single entry and exit. Figure 7.10 shows a basic block written in
`dbC and the corresponding generic SlMD code.
`the operations in
`The macro instruction scheduler attempts to schedule all
`the block to occur in a single clock tick. This is possible only if there are no
`interoperation dependencies. For example, instructions (1) and (2) in Figure 7.10
`are independent and can safely occur
`in a single clock tick, but
`instruc-
`tion (4) depends on the completion of instruction (3). Only after t3 is registered
`can the add occur.
`
`#define N 16
`
`poly unsigned uzN, VtN, w:N, k:N;
`
`= DBC_iproc[0 +: N];
`(poly unsignedzN)
`= V — 1;
`= u l v L k;
`
`€W<1C
`
`(DBC_nproc +1} — DBC_iproc[0 +: N];
`
`(l) opParMove_lL_a (u . address , opParAddOf fset (DBC_ipr0c . address ,
`{0) )
`, 16} r
`
`(2} opParMoveC_lL__a topParAddOffset (_DBC_poly_frame_t_main, 1)
`/* t3:16:l *l,
`[DBC_npr0C + 1),
`l6};
`( 3 ) opParSub3_1L_a (v . addres s ,
`opParAddOffset (_DBC_poly_frarne__t_main,
`/*t3:16:l */,
`OpParAddOf fset (DBC_ipr0c . address ,
`l0) ) i 16) r
`
`l)
`
`l, 16] ,-
`(4} opParSub3c_lL_a (k.address, v.address,
`(5) opParBor3_lL‘_a (opParAddOffset (_DBC_poly_frame_t_mair1, 1)
`/* t5:16:l*/, u-address, v.address,
`l6];
`(6) opParBor3_1L_a (w. address , opParAddOffset
`(_DEC_poly_frame_t_main, 1)
`/*t5:16:1 *l, k.address, 16) ,-
`
`FIGURE 7.10 A Basic Block and Its Translation to Generic SIMD
`
`Petitioner Microsoft Corporation - Ex. 1066, p. 419
`Petitioner Microsoft Corporation - EX. 1066, p. 419
`
`

`

`
`
`94
`
`A Data Parallel Programming Model
`
`Chapter 7
`
`0
`
`0
`
`0
`
`FIGURE 7.1 1 Dependency Graph
`
`the scheduler constructs a dependency matrix M to
`For each basic block,
`reflect the dependencies among parallel instructions. M [i, j] = 1 implies that instruc-
`tion (j) depends on instruction (1'). Then, an As Soon As Possible (ASAP) scheduling
`algorithm is used to sequence the parallel instructions. All instructions j such that
`M[*, j] = 0 can be scheduled in the current tick. Once an instruction j has been
`scheduled, M [j, *1 is set to 0, allowing the instructions that depend on j to be sched-
`uled in the next iteration of the algorithm. Figure 7.11 shows the dependency graph
`for this example.
`The compiler generates a single opcode for the macro instruction. When that
`opcode is issued, a subinstruction shift register is used to sequence through the
`subinstructions. For this example, a 4utick instruction is issued. The sequencing of
`ticks is controlled by a 4-bit shift register.
`
`7.5.2 Discussion
`
`Our approach differs from more general hi gh—level synthesis systems in two respects.
`First, since we focus on the SIMD model, control flow is managed by the host. We
`are concerned only with basic blocks of parallel instructions and need not build and
`schedule a general controluflow graph as is done by the IBM [2] and similar systems.
`Second, in contrast to most high-level synthesis systems that synthesize logic for
`a single chip, we focus on synthesis of the entire FPGA-based parallel computing
`system. Our efforts are directed toward synthesis for the Splash logic array, of which
`generating logic for individual chips is one (important) part. We use a commercial
`FPGA compiler to further optimize the VHDL generated by our system.
`
`7.6 EVALUATION: GENETIC DATABASE SEARCH
`
`Applications involving search for similarity in genome strings have been mapped
`successfully to Splash l [4] and Splash 2 [8]. We have compiled a dbC version of
`this application for Splash 2. A source stream is stored acros

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket