throbber
An Elementary Processor Architecture
`with Simultaneous Instruction Issuing from Multiple Threads
`
`Hiroaki Hirata, Kozo Kimura, Satoshi Nagamine, Yoshiyuki Mochizuki,
`Akio Nishimura, Yoshimori Nakase and Teiji Nishizawa
`
`Media Research Laboratory,
`Matsushita Electric Industrial Co., Ltd.
`1006 Kadoma Osaka.
`571 Japan
`(hirata@isl.mei.co.jp)
`
`Abstract
`
`In this paper, we propose a multithreaded processor
`architecture which improves machine throughput.
`In
`our processor architecture, instructions from different
`threads (not a single thread) are issued simultaneously
`to multiple functional units, and these instructions can
`begin execution unless there are functional unit con-
`flicts. This parallel execution scheme greatly improves
`the utilization of the functional unit. Simulation results
`show that by executing two and four threads in parallel
`on a nine-functional-unit processor, a 2.02 and a 3.72
`times speed-up, respectively, can be achieved over a
`conventional RISC processor.
`Our architecture is also applicable to the efficient ex-
`ecution of a single loop. In order to control functional
`unit conflicts between loop iterations, we have devel-
`oped a new static code scheduling technique. Another
`loop execution scheme, by using the multiple control
`flow mechanism of our architecture, makes it possible
`to parallelize loops which are difficult to parallelize in
`vector or VLIW machines.
`
`1 Introduction
`
`We are currently developing an integrated visualiza-
`tion system which creates virtual worlds and represents
`them through computer graphics. However, the gener-
`ation of high quality images requires great processing
`power. Furthermore, in order to model the real world as
`faithfully as possible, intensive numerical computations
`are also needed. In this paper, we present a processor
`
`The original paper appears in the Proceedings of the 19th An-
`nual International Symposium on Computer Architecture, ACM
`& IEEE-CS, pp. 136-145, May 1992.
`
`1
`
`architecture[1] used as the base processor in a parallel
`machine which could run such a graphics system.
`In the field of computer graphics, ray-tracing and ra-
`diosity are very famous algorithms for generating re-
`alistic images.
`In these algorithms, intersection tests
`account for a large part of the processing time for the
`whole program. This test has inherent coarse-grained
`parallelism and can easily create many threads(parallel
`processes) to be executed on multiprocessing systems.
`Each thread, however, executes a number of data ac-
`cesses and conditional branches. In the case of a dis-
`tributed shared memory system, low processor utiliza-
`tion can result from long latencies due to remote mem-
`ory access. On the other hand, low utilization of func-
`tional units within a processor can result from inter-
`instruction dependencies and functional operation de-
`lays. Another obstacle to the optimization of the single
`thread execution arises if the past performance of a re-
`peatedly executed conditional branch does not help in
`predicting the target of future executions.
`In order to overcome these problems, we introduced
`two kinds of multithreading techniques into our proces-
`sor architecture: concurrent multithreading and parallel
`multithreading. The concurrent multithreading tech-
`nique attempts to remain active during long latencies
`due to remote memory access. When a thread encoun-
`ters an absence of data, the processor rapidly switches
`between threads. On the other hand, parallel multi-
`threading within a processor is a latency-hiding tech-
`nique at the instruction level. When an instruction
`from a thread is not able to be issued because of ei-
`ther a control or data dependency within the thread,
`an independent instruction from another thread is exe-
`cuted. This technique is especially effective in a deeply
`pipelined processor.
`Our goal is to achieve an efficient and cost-effective
`processor design which is oriented specifically for use as
`an elementary processor in a large scale multiprocessor
`system rather than for use in an uniprocessor system.
`The idea of parallel multithreading arose from hard-
`ware resource optimization where several processors are
`
`SAMSUNG-1013
`Page 1 of 10
`
`

`

`united and reorganized so that functional units could be
`fully used. Figure 1(a) shows a general multiprocessor
`system organization. Multiple threads are executed on
`each physical processor. For example, assume that the
`utilization of the busiest functional unit of a processor
`in Figure 1(a) is about 30% because of the instruction
`level dependency. The utilization of the functional unit
`× 100[%], where N is the number
`is defined as U = N·L
`T
`of invocations of the unit, L is an issue latency of the
`unit, and T represents the total execution cycles of the
`program. In this case, three processors could be united
`into one as shown in Figure 1(b), so that the utilization
`of the busiest functional unit could be expected to be
`improved nearly to 30×3=90%. Consequently, on the
`united processor, multiple instructions from different
`threads are issued simultaneously and executed unless
`they conflict with one another to compete for the same
`functional unit.
`The programmer’s view of the logical processor in
`Figure 1(b) is almost as same as the view of physi-
`cal processor in Figure 1(a), although the physical or-
`ganization is different. In the field of numerical com-
`putation, parallel loop execution techniques developed
`for multiprocessor, such as doall or doacross techniques
`which assign iterations to processors, are also applica-
`ble to our processor.
`In this paper, we will concentrate most of our ar-
`chitectural interest upon parallel multithreading and
`restrict ourselves to simply outlining concurrent multi-
`threading. The rest of this paper is organized as follows.
`In section 2, our multithreaded processor architecture
`is addressed. Section 3 demonstrates the effectiveness
`of our architecture via simulation results.
`In section
`4, preceding works on multithreaded architectures are
`presented to be compared with our architecture. Con-
`cluding remarks are made in section 5.
`
`2 Processor Architecture
`
`2.1 Machine Model
`
`2.1.1 Hardware Organization
`
`As shown in Figure 2, the processor is provided with
`several instruction queue unit and decode unit pairs,
`called thread slots. Each thread slot, associated with
`a program counter, makes up a logical processor, while
`an instruction fetch unit and all functional units are
`physically shared among logical processors.
`An instruction queue unit has a buffer which saves
`some instructions succeeding the instruction indicated
`by the program counter. The buffer size needs to be at
`least B = S×C words, where S is the number of thread
`slots, and C is the number of cycles required to access
`the instruction cache.
`
`Interconnection Network and Memories
`
`an Instruction
`and a Data
`Caches
`
`Sequencer
`
`Functional
`Units
`
`Register Set
`
`Physical
`Processor
`
`(a) Conventional organization
`
`Interconnection Network and Memories
`
`Logical
`Processor
`
`Physical
`Processor
`
`an Instruction
`and a Data Caches
`
`Sequencer
`
`Functional
`Units
`
`Register Set
`
`(b) Organization using united elementary processors
`
`Figure 1: Multiprocessor system organization
`
`An instruction fetch unit fetches at most B instruc-
`tions for one thread every C cycles from the instruc-
`tion cache, and attempts to fill the buffer in the in-
`struction queue unit. This fetching operation is done
`in an interleaved fashion for multiple threads. So, on
`the average, the buffer in one instruction queue unit
`is filled once in B cycles. When one of the threads en-
`counters a branch instruction, however, that thread can
`preempt the fetching operation. The instruction cache
`and fetch unit might become bottleneck for a processor
`with many thread slots. In such a case, another cache
`and fetch unit would be needed.
`A decode unit gets an instruction from an instruc-
`tion queue unit and decodes it. The instruction set is
`based on a RISC type, and a load/store architecture is
`assumed. Branch instructions are executed within the
`decode unit. The data dependencies of an instruction
`are checked using the scoreboarding technique, and is-
`sued if it is free of dependencies. Otherwise, issuing is
`interlocked.
`Issued instructions are dynamically scheduled by in-
`struction schedule units and delivered to functional
`units. Unless an instruction conflicts with other in-
`structions issued from other decode units over the same
`
`2
`
`SAMSUNG-1013
`Page 2 of 10
`
`

`

`Instruction Cache
`
`Instruction Fetch Unit
`
`Program
`Counter
`
`Thread Slot
`
`Instruction
`Queue Unit
`
`Decode Unit
`
`Standby Station
`
`Instruction Schedule
`Unit
`
`Integer
`ALU
`
`Barrel
`Shifter
`
`Integer
`Multiplier
`
`FP
`Adder
`
`FP
`Multiplier
`
`FP
`Converter
`
`Load/Store
`Unit
`
`Load/Store
`Unit
`
`Bank
`0
`
`Bank
`1
`
`Bank
`2
`
`Register Set
`(allocated for
`executing thread)
`
`Register Set
`(allocated for
`waiting thread)
`
`Register Set
`(allocated for
`ready thread)
`
`Data
`Cache
`
`Bank
`n
`
`Queue Registers
`
`Large Register Files
`and Queue Registers
`
`Figure 2: Hardware organization with three thread slots
`
`functional unit, the instruction is sent immediately to
`the functional unit and executed. Otherwise, the in-
`struction is arbitrated by the instruction schedule unit.
`When an instruction is not selected by an instruction
`schedule unit, it is stored in a standby station and re-
`mains there until it is selected. Standby stations can al-
`low decode units to proceed with their operations even
`if previously issued instructions cause resource conflicts.
`For example, while a shift instruction stays in a standby
`station, a succeeding add instruction from the same
`thread could be sent to the ALU. Consequently, instruc-
`tions from a thread are executed out of order, though
`they are issued from a decode unit in order. A standby
`station is a simple latch whose depth is one, and differs
`from a reservation station in Tomasulo’s algorithm[2]
`because issued instructions are guaranteed to be free of
`dependencies in our architecture.
`A large general-purpose register file, as well as
`floating-point one, is divided into banks, each of which
`is used as a full register set private to a thread. Each
`bank has two read ports and one write port. More
`read ports are unnecessary because operand data can be
`stored in standby stations. In order to support concur-
`rent multithreading, the physical processor is provided
`with more register banks than thread slots. When a
`thread is executed, the bank allocated for the thread
`is logically bound to the logical processor. The exclu-
`
`sive one-to-one binding between a logical processor and
`a register bank guarantees that the logical processor
`does not access any register banks other than the bank
`In contrast, queue registers are special
`bound to it.
`registers which enable communications between logical
`processors at the register-transfer level.
`Some disadvantages with our machine model include
`the hardware cost of register files, multiple thread slots,
`and dynamic instruction scheduling facilities. The in-
`creased complexity in the network between functional
`units and register banks is also a disadvantage.
`
`2.1.2
`
`Instruction Pipeline
`
`Figure 3(a) shows the instruction pipeline of the logical
`processor, which is a modification of a superpipelined
`RISC processor as shown in Figure 3(b). In the logi-
`cal processor pipeline, two instruction fetch(IF) stages
`are followed by two decode(D) stages, followed by a
`schedule(S) stage, followed by multiple execution(EX)
`stages, followed by a write(W) stage.
`The IF stages are shown for convenience, and, in fact,
`an instruction is read from a buffer of an instruction
`queue unit in one cycle. In stage D1, the format or type
`of the instruction is tested. In stage D2, the instruction
`is checked if it is issuable or not. Stage S is inserted for
`the dynamic scheduling in instruction schedule units.
`
`3
`
`SAMSUNG-1013
`Page 3 of 10
`
`

`

`I
`
`1
`
`IF
`1
`
`IF
`2
`
`D
`
`1
`
`D
`
`2
`
`S
`
`EX
`
`1
`
`EX
`
`2
`
`W
`
`I
`
`2
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`
`2
`
`S
`
`EX
`
`1
`
`EX
`
`2
`
`I
`
`3
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`2
`
`S
`
`(a) Instruction pipeline of multithreaded processor
`
`I
`
`1
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`2
`
`EX
`1
`
`EX
`2
`
`W
`
`I
`
`2
`
`IF
`1
`3I
`
`IF
`2
`
`IF
`1
`
`D
`1
`
`IF
`2
`
`D
`2
`
`D
`
`1
`
`EX
`1
`
`EX
`2
`
`W
`
`D
`
`2
`
`EX
`
`1
`
`EX
`
`2
`
`(b) Instruction pipeline of base RISC processor
`
`Figure 3: Instruction pipeline
`
`The number of EX stages is dependent on the kind of
`instruction. For example, in our simulation (section 3),
`we assumed each operation had latencies as listed in Ta-
`ble 1. The result latency means the number of cycles
`before the instruction finishes execution (i.e. the num-
`ber of EX stages), and the issue latency is the number
`of stages before another instruction of the same type
`may be issued. Since two cycles are assumed to be
`required to access the data cache as well as the instruc-
`tion cache, the issue latency of load/store instructions
`is two cycles. Other instructions can be accepted in ev-
`ery cycle. So, the issue latency of these instructions is
`one cycle. In stage W, the result value is written back
`to a register.
`Required operands are read from registers in stage
`S. A scoreboarding bit corresponding to a destination
`register is flaged on in stage S and off in the last of
`the EX stages. This can avoid additional delay on the
`initiation of data-dependent instructions. That is, as-
`suming instruction I2 uses the result of instruction I1 as
`a source, at least three cycles are required between I1
`and I2 in Figure 3(a). The same cycles are also required
`in the base RISC pipeline in Figure 3(b).
`In the case of branch instructions, an instruction
`fetch request is sent to the instruction fetch unit at the
`end of stage D1, even if a conditional branch outcome
`is still unknown at that point. Assume I1 is a branch
`instruction and I3 is an instruction executed immedi-
`ately after I1.
`In Figure 3(b), the delay between I1
`and I3 is four cycles. But the delay in Figure 3(a) is
`five. What is worse, it could become more than five if
`some threads encounter branches at the same time. It
`
`Table 1: Functional unit and issue/result latencies of
`instructions
`
`functional
`unit
`
`category of
`instructions
`
`latency (cycle)
`issue
`result
`
`Integer
`ALU
`
`Barrel Shifter
`Integer
`Multiplier
`FP Adder
`
`add/subtract
`logical
`compare
`shift
`multiply
`divide
`add/subtract
`compare
`absolute/negate
`move
`FP Multiplier multiply
`divide
`convert
`load
`store
`
`FP Converter
`Load/Store
`
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`2
`2
`
`2
`2
`2
`2
`6
`6
`4
`4
`2
`2
`8
`24
`4
`4
`(4)
`
`is obvious that the single thread performance is dam-
`aged. In our architecture, however, while some threads
`are blocked because of branches, other types of instruc-
`tions from other threads are executed. Consequently,
`the parallel multithreading scheme has a potential to
`hide the delay of branches as well as other arithmetical
`operation delays, and enhance machine throughput.
`
`2.1.3 Support of Concurrent Multithreading
`
`Each pair of general purpose and floating-point register
`sets, together with an instruction address save register
`and a thread status register, is conceptually grouped
`into a single entity called a context frame, and allocated
`to a thread. An instruction address save register and
`a thread status register are used as save areas for the
`program counter and program status words (which hold
`various pieces of thread-specific state), respectively.
`Since the processor has more context frames than
`thread slots, context switching is achieved rapidly by
`changing the logical binding between a logical proces-
`sor and a context frame. As long as the number of
`threads to be scheduled does not exceed the number
`of physical context frames, threads can be resident on
`the physical processor, reducing the overhead to save
`or restore contexts to/from memory.
`Another important element of the context frame is
`the access requirement buffer, which contains outstand-
`ing memory access requirements. When a thread in
`running state issues load/store instructions, these in-
`structions are copied and recorded in the access require-
`ment buffer, and deleted when they are performed.
`The processor is pipelined and standby stations en-
`able out-of-order execution of instructions. This makes
`
`4
`
`SAMSUNG-1013
`Page 4 of 10
`
`

`

`signed to the thread slot which had the highest priority
`before the rotation.
`The instruction schedule unit works in one of two
`implicit-rotation mode and explicit-rotation
`modes:
`mode. The mode is switched to the alternative mode
`through a privileged instruction.
`In the implicit-
`rotation mode, priority rotation occurs at a given num-
`ber of cycles(rotation interval), as shown in Figure 4.
`On the other hand, in the explicit-rotation mode, the
`rotation of priority is controlled by software. The rota-
`tion is done when a change-priority instruction is exe-
`cuted on the logical processor with the highest priority.
`There are two reasons why our architecture supports
`the explicit-rotation mode. One is to aid the compiler
`to schedule the code of threads executed in parallel
`when it is possible. The other is to parallelize loops
`which are difficult to parallelize using other architec-
`tures. These features are discussed in section 2.3.2 and
`2.3.3, respectively.
`
`2.3 Parallel Execution of a Single Loop
`
`2.3.1 Overview
`
`Parallel execution of a single loop is available on our
`multithreaded machine, by assigning iterations to log-
`ical processors. For example, in the case of a physical
`processor with n thread slots, the kth logical processor
`executes the ni+kth(i = 0, 1,···) iteration respectively.
`The explicit-rotation mode presented in section 2.2
`is one of the facilities supporting parallel execution of
`a loop. In this mode, a context switch due to data ab-
`sence is suppressed and a physical processor is occupied
`entirely by the threads from a loop. In the usual case,
`a change-priority instruction is inserted at the end of
`each iteration by the compiler.
`A fast-fork instruction generates the same number
`of threads as logical processors by setting its own next
`instruction address to program counters of other thread
`slots and activating them.
`It also sets unique logical
`processor identifiers to special registers corresponding
`to each logical processor. Each thread, therefore, can
`be informed which iterations it executes.
`In the case of doall type of loops,
`it is unneces-
`sary for logical processors to communicate with one
`another. But the execution of doacross type of loops re-
`quires communication between logical processors. One
`solution would be communication through memory.
`But in order to reduce the communication overhead,
`we provide the processor with queue registers. They
`are queue-structured registers connected to functional
`units.
`The connection topology among logical processors
`through queue registers is important when considering
`the trade-offs between hardware cost and its effect. One
`simple and effective connection topology is a ring net-
`
`5
`
`Selected
`
`higher
`
`: issued instruction
`
`B
`
`C
`
`A
`
`Time
`
`B
`
`C
`
`A
`
`C
`
`A
`Rotation
`Interval
`
`B
`
`A
`
`B
`
`C
`
`A
`
`B
`
`C
`
`Priority
`
`lower
`
`Figure 4: Dynamic instruction selection policy in in-
`struction schedule unit (A, B, and C denote thread
`slots.)
`
`the restart mechanism of programs somewhat complex.
`Before the thread is switched out, the logical processor
`should wait until all issued instructions but load/store
`instructions are performed. A load/store instruction is
`not always performed in short latency, and this is the
`main source of context switching in our architecture.
`Therefore, the load/store instructions in the execution
`pipeline could be flushed. But the restarting of threads
`is possible because such outstanding memory access re-
`quests are also saved as a piece of the context. When
`the thread is resumed, instructions in the access re-
`quirement buffer are decoded again and re-executed,
`being followed by the instruction indicated by the in-
`struction address save register. This mechanism also
`ensures imprecise but restartable interrupts on page
`fault. Mechanisms similar to this can be applied to
`floating-point exception handling.
`Context switching is triggered by a data absence trap.
`Detailed conditions concerning such traps must be con-
`sidered with care, because incorrect choices can create
`starvation and thrashing problems. We will report our
`solution including cache/memory architecture in an-
`other paper in the future.
`
`2.2 Instruction Scheduling Strategy
`
`In this section, we present the dynamic instruction se-
`lection policy in the instruction schedule unit. Simple
`selection policy is preferable, so as not to enlarge the
`instruction pipeline pitch.
`Figure 4 illustrates an example of the policy with
`multi-level priorities. A unique priority is assigned to
`each thread slot. Unless an issued instruction conflicts
`with other instructions from other thread slots to which
`higher priorities are assigned, it is selected by the in-
`struction schedule unit.
`In order to avoid starvation,
`the priorities are rotated. The lowest priority is as-
`
`SAMSUNG-1013
`Page 5 of 10
`
`

`

`compiler employs more judicious code scheduling tech-
`niques.
`For example, software pipelining[5, 6], which is an
`effective code scheduling technique for VLIW architec-
`tures, employs a resource reservation table to avoid re-
`source conflicts. Such a technique is also applicable to
`our architecture. But it might postpone the issue of
`instructions which would cause a resource conflict. So,
`straight use of the algorithm could miss opportunities
`to issue instructions. Consequently, we have developed
`a new algorithm which makes the most of the function
`of standby stations and instruction schedule units.
`Our algorithm employs not only a resource reserva-
`tion table but also a standby table whose entry corre-
`sponds to a standby station. Our algorithm differs from
`software pipelining in the point described below. When
`all of the instructions without dependencies at an issu-
`ing cycle have resource conflicts, a software pipeliner
`would generate a NOP code. Our code scheduler, how-
`ever, checks entries of the standby table for those in-
`structions. If an entry is not marked, the instruction
`is determined to be issued and the entry is marked.
`This means the instruction is stored in a standby sta-
`tion. The explicit-rotation mode enables the compiler
`to know which instruction is selected. The resource
`reservation table is not used only to determine if there
`is a resource conflict, but also to tell the compiler when
`the instruction in the standby station is executed.
`
`2.3.3 Eager Execution of Sequential Loop Iter-
`ations
`
`Using an example, this section shows that our architec-
`ture can parallelize loops which are difficult to paral-
`lelize using vector or VLIW architectures.
`The source code fragment of a sample program is
`shown in Figure 6. This program traverses a linear
`linked list. Although it is a simple example, it demon-
`strates the application of our execution scheme using a
`loop structure that is fundamental to many application
`programs.
`
`ptr = header;
`while ( ptr != NULL ) {
`tmp = a * ((ptr->point)->x)
`+ b * ((ptr->point)->y) + c;
`if ( tmp < 0 )
`break;
`ptr = ptr -> next;
`
`}
`
`Figure 6: A sample program (written in C)
`
`Each iteration is executed on a logical processor, and
`the value of ptr is passed through queue registers. The
`
`6
`
`Logical
`Processor
`write
`
`read
`
`0123
`
`Queue
`Register
`
`Logical
`Processor
`write
`
`read
`
`3210
`
`Queue
`Register
`
`Logical
`Processor
`write
`
`read
`
`0123
`
`31
`Register Set
`
`31
`Register Set
`
`31
`Register Set
`
`Queue Register
`
`Figure 5: Logical view of queue registers
`
`work, as shown in Figure 5. Many doacross type of
`loops encountered in practice have one iteration differ-
`ence. When a loop has more than one iteration differ-
`ence, data should be relayed through several logical pro-
`cessors. This could create overhead. Some techniques,
`however, are available to convert loops so that the it-
`eration difference is reduced to one. Loop unrolling is
`one such technique[3].
`Queue registers are enabled and disabled by spe-
`cial
`instructions. When enabled, two queue regis-
`ters, for reading from the previous and writing to the
`next logical processor respectively are mapped into two
`general-purpose or floating-point registers. The refer-
`ence to such registers is interpreted as the reference
`to queue registers, reducing data move overheads be-
`tween queue registers and general/floating-point regis-
`ters. Full/empty bits attached to queue registers are
`available as scoreboarding bits.
`
`2.3.2 Static Code Scheduling
`
`As mentioned in section 1, in the case of computer
`graphics programs, it is often difficult to predict the
`control sequence of a thread before execution, be-
`cause the sequence is determined dynamically by data-
`dependent branches. In such cases, in order to obtain
`high throughput, the compiler could employ no other
`techniques than a simple list scheduling algorithm[4].
`The compiler reorders the code without consideration
`of other threads, and concentrates on shortening the
`processing time for each thread. Short processing time
`of a single thread means a high issuing rate of instruc-
`tions because the number of instructions is constant
`whether scheduled or not. Though this scheduling ap-
`proach does not have control over resource conflicts, it
`aims at flooding functional units with sufficient instruc-
`tions by parallel execution of such scheduled code.
`On the other hand, in the case of numerical com-
`putation programs, it is often possible to predict the
`execution-time behavior of a loop. Therefore, the
`
`SAMSUNG-1013
`Page 6 of 10
`
`

`

`3 Estimation
`
`3.1 Simulation Model and Assumptions
`
`In the following four sections, we present simulation re-
`sults of our parallel multithreading architecture. Since
`cache simulation has not yet been implemented in our
`simulator, we assumed that attempts to access caches
`were all hit.
`We used two kinds of functional unit configurations
`(see Figure 2). One consists of seven heterogeneous
`units. The other consists of those units with another
`load/store unit. In practice, a data cache might consist
`of several banks in order to be accessed simultaneously
`by two load/store units. But our simulation assumed
`there was no bank conflict. Latencies of each instruc-
`tion are listed in Table 1. All machines simulated in
`this paper are not equipped with delayed branch mech-
`anisms, branch prediction facilities, or overlapped reg-
`ister windows.
`In order to estimate our architecture, we use the
`speed-up ratio as a criterion. It is defined as the pro-
`portion of total cycles required by multithreaded exe-
`cution to those by sequential execution. Sequential ex-
`ecution means the execution of the sequential version of
`program on a RISC based processor whose instruction
`pipeline is shown in Figure 3(b).
`
`3.2 Speed-up by Parallel Multithread-
`ing
`
`This section presents simulation results of our multi-
`threaded machine. As an application program, we have
`chosen a ray-tracing program, which can be easily par-
`allelized at every pixel. The program written in C was
`compiled by a commercial RISC compiler with an op-
`timization option. The object code was executed on
`a workstation, and traced instruction sequences were
`translated to be used for our simulator.
`Table 2 lists speed-up ratios when the rotation in-
`terval in the instruction schedule unit is eight cycles.
`In the case of a processor with one load/store unit,
`an 1.83 times speed-up is gained by parallel execution
`of two instruction streams, although all of the hard-
`ware of a single-threaded processor is not duplicated in
`the processor. A performance improvement of a factor
`of 2.89 is still possible with four thread slots, though
`it is less effective (2.89/1.83=1.58) than the improve-
`ment attained by increasing from one to two thread
`slots (1.83). When eight thread slots are provided, the
`utilization of the busiest functional unit, the load/store
`unit, becomes 99%. This is the reason why the speed-
`up ratio is saturated at only 3.22 times. Addition of
`another load/store unit improves speed-up ratios by
`10.4∼79.8%.
`
`Iteration
`
`ptr
`
`ptr’
`
`ptr’’
`
`ptr’’’
`
`Time
`
`ptr’’’’
`
`ptr’’’’’
`
`kill other threads
`
`Figure 7: Parallel execution of while loop
`
`compiler generates multiple versions for a variable ptr
`so that iterations can be initiated without having to
`wait for preceding iterations to complete. Such opti-
`mization with respect to a variable is a variation of the
`use of variables in dataflow machines. Figure 7 illus-
`trates the parallel execution scheme. Each thread re-
`ceives the value of ptr from the thread executed in the
`immediately preceding iteration, and passes the next
`value to the thread executed in the immediately suc-
`ceeding iteration. After that, the thread continues to
`execute the rest of the loop body.
`A thread can initiate an iteration which might not
`have been executed in the sequential execution. So, we
`call this scheme eager execution.
`The instruction selection policy in instruction sched-
`ule units plays an important part in preserving the se-
`mantics of the sequential program. The execution of an
`iteration is not acknowledged until the thread gets the
`highest priority. It is the thread with the highest pri-
`ority who executes the earliest iteration remaining at
`that point. By using multilevel priority, the instruction
`selection policy guarantees the execution of succeeding
`iterations does not hinder the execution of preceding
`iterations.
`When a thread has exited from the loop, it tries to
`stop the operation of other thread slots and to kill other
`running threads. Such a special instruction is valid only
`for the thread with the highest priority. When a thread
`without the highest priority tries to execute this in-
`struction, it is put into an interlocked state until it is
`given the highest priority.
`If the variable tmp is global and alive after the loop,
`the compiler should use the special store instruction
`which is performed only by the thread with the highest
`priority. This guarantees that writes to the memory
`location of tmp are performed in order with respect to
`the source code. Some instructions are also provided to
`ensure consistency between contexts of threads before
`and after the execution of the loop.
`
`7
`
`SAMSUNG-1013
`Page 7 of 10
`
`

`

`Table 2: Speed-up ratio by parallel multitthreading
`
`with one
`load/store unit
`no. of
`thread without with
`slots
`standby
`standby
`station
`station
`1.79
`1.83
`2.84
`2.89
`3.22
`3.22
`
`2
`4
`8
`
`with two
`load/store units
`without with
`standby
`standby
`station
`station
`2.01
`2.02
`3.68
`3.72
`5.68
`5.79
`
`Table 3: Tradeoff between speed-up and employed par-
`allelism (speed-up ratio)
`
`total no. of
`issued instruc-
`tions (D×S)
`2
`4
`8
`
`no. of issued instructions
`from each thread slot (D)
`1
`2
`4
`8
`2.02
`1.31
`—
`—
`3.72
`2.43
`1.52
`—
`5.79
`4.37
`2.79
`1.66
`
`Standby stations improve the speed-up ratio by
`0∼2.2%. This low improvement is due to poor paral-
`lelism within an instruction stream. In the case of ap-
`plication programs whose thread is rich in fine-grained
`parallelism, greater improvement can be achieved.
`We also simulated a machine whose thread slots are
`provided with private instruction caches and instruc-
`tion fetch units. Such organization does provide a slight
`speed-up. Achieved speed-up ratios are the same as
`those in Table 2, except that 1.79 and 5.79 is respec-
`tively replaced by 1.80 and 5.80. This shows that the
`delay due to instruction fetch conflicts is hidden and
`sharing an instruction cache between thread slots is
`possible.
`We also examined the execution cycles with various
`rotation intervals (2n cycles, where n is 0∼8). But, in
`this application program, rotation interval did not have
`much influence on the performance, though using eight
`or sixteen cycles seems slightly superior.
`
`3.3 Multithreading with Superscalar
`Design
`
`In our simulator, each thread slot can support multi-
`ple instruction issuing from a single instruction stream
`similar to superscalar architectures. In this section, us-
`ing the same program as in section 3.2, we discuss hy-
`brid architectures which employ both coarse- and fine-
`grained parallelism.
`A hybrid processor is represented here by (D,S)-
`processor, where D is the maximum number of instruc-
`tions which a thread slot issues each cycle, and S is
`the number of thread slots. For a fair comparison, the
`hardware cost of (d,s)-processor is almost as same as
`that of (1,d×s)-processor. For example, the instruction
`fetch bandwidth is d×s words per cycle in both proces-
`sors. Although (d×2,s/2)-processor is provided with a
`half register set of (d,s)-processor, it requires double the
`number of register ports. Each thread slot checks the
`dependencies of instructions in the instruction window
`of size D, and issues instructions without dependencies
`every cycle. The instruction window is filled every cy-
`
`8
`
`cle. Some techniques[7, 8] are proposed to boost the
`superscalar performance, but they are not provided for
`the thread slot because

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket