`with Simultaneous Instruction Issuing from Multiple Threads
`
`Hiroaki Hirata, Kozo Kimura, Satoshi Nagamine, Yoshiyuki Mochizuki,
`Akio Nishimura, Yoshimori Nakase and Teiji Nishizawa
`
`Media Research Laboratory,
`Matsushita Electric Industrial Co., Ltd.
`1006 Kadoma Osaka.
`571 Japan
`(hirata@isl.mei.co.jp)
`
`Abstract
`
`In this paper, we propose a multithreaded processor
`architecture which improves machine throughput.
`In
`our processor architecture, instructions from different
`threads (not a single thread) are issued simultaneously
`to multiple functional units, and these instructions can
`begin execution unless there are functional unit con-
`flicts. This parallel execution scheme greatly improves
`the utilization of the functional unit. Simulation results
`show that by executing two and four threads in parallel
`on a nine-functional-unit processor, a 2.02 and a 3.72
`times speed-up, respectively, can be achieved over a
`conventional RISC processor.
`Our architecture is also applicable to the efficient ex-
`ecution of a single loop. In order to control functional
`unit conflicts between loop iterations, we have devel-
`oped a new static code scheduling technique. Another
`loop execution scheme, by using the multiple control
`flow mechanism of our architecture, makes it possible
`to parallelize loops which are difficult to parallelize in
`vector or VLIW machines.
`
`1 Introduction
`
`We are currently developing an integrated visualiza-
`tion system which creates virtual worlds and represents
`them through computer graphics. However, the gener-
`ation of high quality images requires great processing
`power. Furthermore, in order to model the real world as
`faithfully as possible, intensive numerical computations
`are also needed. In this paper, we present a processor
`
`The original paper appears in the Proceedings of the 19th An-
`nual International Symposium on Computer Architecture, ACM
`& IEEE-CS, pp. 136-145, May 1992.
`
`1
`
`architecture[1] used as the base processor in a parallel
`machine which could run such a graphics system.
`In the field of computer graphics, ray-tracing and ra-
`diosity are very famous algorithms for generating re-
`alistic images.
`In these algorithms, intersection tests
`account for a large part of the processing time for the
`whole program. This test has inherent coarse-grained
`parallelism and can easily create many threads(parallel
`processes) to be executed on multiprocessing systems.
`Each thread, however, executes a number of data ac-
`cesses and conditional branches. In the case of a dis-
`tributed shared memory system, low processor utiliza-
`tion can result from long latencies due to remote mem-
`ory access. On the other hand, low utilization of func-
`tional units within a processor can result from inter-
`instruction dependencies and functional operation de-
`lays. Another obstacle to the optimization of the single
`thread execution arises if the past performance of a re-
`peatedly executed conditional branch does not help in
`predicting the target of future executions.
`In order to overcome these problems, we introduced
`two kinds of multithreading techniques into our proces-
`sor architecture: concurrent multithreading and parallel
`multithreading. The concurrent multithreading tech-
`nique attempts to remain active during long latencies
`due to remote memory access. When a thread encoun-
`ters an absence of data, the processor rapidly switches
`between threads. On the other hand, parallel multi-
`threading within a processor is a latency-hiding tech-
`nique at the instruction level. When an instruction
`from a thread is not able to be issued because of ei-
`ther a control or data dependency within the thread,
`an independent instruction from another thread is exe-
`cuted. This technique is especially effective in a deeply
`pipelined processor.
`Our goal is to achieve an efficient and cost-effective
`processor design which is oriented specifically for use as
`an elementary processor in a large scale multiprocessor
`system rather than for use in an uniprocessor system.
`The idea of parallel multithreading arose from hard-
`ware resource optimization where several processors are
`
`SAMSUNG-1013
`Page 1 of 10
`
`
`
`united and reorganized so that functional units could be
`fully used. Figure 1(a) shows a general multiprocessor
`system organization. Multiple threads are executed on
`each physical processor. For example, assume that the
`utilization of the busiest functional unit of a processor
`in Figure 1(a) is about 30% because of the instruction
`level dependency. The utilization of the functional unit
`× 100[%], where N is the number
`is defined as U = N·L
`T
`of invocations of the unit, L is an issue latency of the
`unit, and T represents the total execution cycles of the
`program. In this case, three processors could be united
`into one as shown in Figure 1(b), so that the utilization
`of the busiest functional unit could be expected to be
`improved nearly to 30×3=90%. Consequently, on the
`united processor, multiple instructions from different
`threads are issued simultaneously and executed unless
`they conflict with one another to compete for the same
`functional unit.
`The programmer’s view of the logical processor in
`Figure 1(b) is almost as same as the view of physi-
`cal processor in Figure 1(a), although the physical or-
`ganization is different. In the field of numerical com-
`putation, parallel loop execution techniques developed
`for multiprocessor, such as doall or doacross techniques
`which assign iterations to processors, are also applica-
`ble to our processor.
`In this paper, we will concentrate most of our ar-
`chitectural interest upon parallel multithreading and
`restrict ourselves to simply outlining concurrent multi-
`threading. The rest of this paper is organized as follows.
`In section 2, our multithreaded processor architecture
`is addressed. Section 3 demonstrates the effectiveness
`of our architecture via simulation results.
`In section
`4, preceding works on multithreaded architectures are
`presented to be compared with our architecture. Con-
`cluding remarks are made in section 5.
`
`2 Processor Architecture
`
`2.1 Machine Model
`
`2.1.1 Hardware Organization
`
`As shown in Figure 2, the processor is provided with
`several instruction queue unit and decode unit pairs,
`called thread slots. Each thread slot, associated with
`a program counter, makes up a logical processor, while
`an instruction fetch unit and all functional units are
`physically shared among logical processors.
`An instruction queue unit has a buffer which saves
`some instructions succeeding the instruction indicated
`by the program counter. The buffer size needs to be at
`least B = S×C words, where S is the number of thread
`slots, and C is the number of cycles required to access
`the instruction cache.
`
`Interconnection Network and Memories
`
`an Instruction
`and a Data
`Caches
`
`Sequencer
`
`Functional
`Units
`
`Register Set
`
`Physical
`Processor
`
`(a) Conventional organization
`
`Interconnection Network and Memories
`
`Logical
`Processor
`
`Physical
`Processor
`
`an Instruction
`and a Data Caches
`
`Sequencer
`
`Functional
`Units
`
`Register Set
`
`(b) Organization using united elementary processors
`
`Figure 1: Multiprocessor system organization
`
`An instruction fetch unit fetches at most B instruc-
`tions for one thread every C cycles from the instruc-
`tion cache, and attempts to fill the buffer in the in-
`struction queue unit. This fetching operation is done
`in an interleaved fashion for multiple threads. So, on
`the average, the buffer in one instruction queue unit
`is filled once in B cycles. When one of the threads en-
`counters a branch instruction, however, that thread can
`preempt the fetching operation. The instruction cache
`and fetch unit might become bottleneck for a processor
`with many thread slots. In such a case, another cache
`and fetch unit would be needed.
`A decode unit gets an instruction from an instruc-
`tion queue unit and decodes it. The instruction set is
`based on a RISC type, and a load/store architecture is
`assumed. Branch instructions are executed within the
`decode unit. The data dependencies of an instruction
`are checked using the scoreboarding technique, and is-
`sued if it is free of dependencies. Otherwise, issuing is
`interlocked.
`Issued instructions are dynamically scheduled by in-
`struction schedule units and delivered to functional
`units. Unless an instruction conflicts with other in-
`structions issued from other decode units over the same
`
`2
`
`SAMSUNG-1013
`Page 2 of 10
`
`
`
`Instruction Cache
`
`Instruction Fetch Unit
`
`Program
`Counter
`
`Thread Slot
`
`Instruction
`Queue Unit
`
`Decode Unit
`
`Standby Station
`
`Instruction Schedule
`Unit
`
`Integer
`ALU
`
`Barrel
`Shifter
`
`Integer
`Multiplier
`
`FP
`Adder
`
`FP
`Multiplier
`
`FP
`Converter
`
`Load/Store
`Unit
`
`Load/Store
`Unit
`
`Bank
`0
`
`Bank
`1
`
`Bank
`2
`
`Register Set
`(allocated for
`executing thread)
`
`Register Set
`(allocated for
`waiting thread)
`
`Register Set
`(allocated for
`ready thread)
`
`Data
`Cache
`
`Bank
`n
`
`Queue Registers
`
`Large Register Files
`and Queue Registers
`
`Figure 2: Hardware organization with three thread slots
`
`functional unit, the instruction is sent immediately to
`the functional unit and executed. Otherwise, the in-
`struction is arbitrated by the instruction schedule unit.
`When an instruction is not selected by an instruction
`schedule unit, it is stored in a standby station and re-
`mains there until it is selected. Standby stations can al-
`low decode units to proceed with their operations even
`if previously issued instructions cause resource conflicts.
`For example, while a shift instruction stays in a standby
`station, a succeeding add instruction from the same
`thread could be sent to the ALU. Consequently, instruc-
`tions from a thread are executed out of order, though
`they are issued from a decode unit in order. A standby
`station is a simple latch whose depth is one, and differs
`from a reservation station in Tomasulo’s algorithm[2]
`because issued instructions are guaranteed to be free of
`dependencies in our architecture.
`A large general-purpose register file, as well as
`floating-point one, is divided into banks, each of which
`is used as a full register set private to a thread. Each
`bank has two read ports and one write port. More
`read ports are unnecessary because operand data can be
`stored in standby stations. In order to support concur-
`rent multithreading, the physical processor is provided
`with more register banks than thread slots. When a
`thread is executed, the bank allocated for the thread
`is logically bound to the logical processor. The exclu-
`
`sive one-to-one binding between a logical processor and
`a register bank guarantees that the logical processor
`does not access any register banks other than the bank
`In contrast, queue registers are special
`bound to it.
`registers which enable communications between logical
`processors at the register-transfer level.
`Some disadvantages with our machine model include
`the hardware cost of register files, multiple thread slots,
`and dynamic instruction scheduling facilities. The in-
`creased complexity in the network between functional
`units and register banks is also a disadvantage.
`
`2.1.2
`
`Instruction Pipeline
`
`Figure 3(a) shows the instruction pipeline of the logical
`processor, which is a modification of a superpipelined
`RISC processor as shown in Figure 3(b). In the logi-
`cal processor pipeline, two instruction fetch(IF) stages
`are followed by two decode(D) stages, followed by a
`schedule(S) stage, followed by multiple execution(EX)
`stages, followed by a write(W) stage.
`The IF stages are shown for convenience, and, in fact,
`an instruction is read from a buffer of an instruction
`queue unit in one cycle. In stage D1, the format or type
`of the instruction is tested. In stage D2, the instruction
`is checked if it is issuable or not. Stage S is inserted for
`the dynamic scheduling in instruction schedule units.
`
`3
`
`SAMSUNG-1013
`Page 3 of 10
`
`
`
`I
`
`1
`
`IF
`1
`
`IF
`2
`
`D
`
`1
`
`D
`
`2
`
`S
`
`EX
`
`1
`
`EX
`
`2
`
`W
`
`I
`
`2
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`
`2
`
`S
`
`EX
`
`1
`
`EX
`
`2
`
`I
`
`3
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`2
`
`S
`
`(a) Instruction pipeline of multithreaded processor
`
`I
`
`1
`
`IF
`1
`
`IF
`2
`
`D
`1
`
`D
`2
`
`EX
`1
`
`EX
`2
`
`W
`
`I
`
`2
`
`IF
`1
`3I
`
`IF
`2
`
`IF
`1
`
`D
`1
`
`IF
`2
`
`D
`2
`
`D
`
`1
`
`EX
`1
`
`EX
`2
`
`W
`
`D
`
`2
`
`EX
`
`1
`
`EX
`
`2
`
`(b) Instruction pipeline of base RISC processor
`
`Figure 3: Instruction pipeline
`
`The number of EX stages is dependent on the kind of
`instruction. For example, in our simulation (section 3),
`we assumed each operation had latencies as listed in Ta-
`ble 1. The result latency means the number of cycles
`before the instruction finishes execution (i.e. the num-
`ber of EX stages), and the issue latency is the number
`of stages before another instruction of the same type
`may be issued. Since two cycles are assumed to be
`required to access the data cache as well as the instruc-
`tion cache, the issue latency of load/store instructions
`is two cycles. Other instructions can be accepted in ev-
`ery cycle. So, the issue latency of these instructions is
`one cycle. In stage W, the result value is written back
`to a register.
`Required operands are read from registers in stage
`S. A scoreboarding bit corresponding to a destination
`register is flaged on in stage S and off in the last of
`the EX stages. This can avoid additional delay on the
`initiation of data-dependent instructions. That is, as-
`suming instruction I2 uses the result of instruction I1 as
`a source, at least three cycles are required between I1
`and I2 in Figure 3(a). The same cycles are also required
`in the base RISC pipeline in Figure 3(b).
`In the case of branch instructions, an instruction
`fetch request is sent to the instruction fetch unit at the
`end of stage D1, even if a conditional branch outcome
`is still unknown at that point. Assume I1 is a branch
`instruction and I3 is an instruction executed immedi-
`ately after I1.
`In Figure 3(b), the delay between I1
`and I3 is four cycles. But the delay in Figure 3(a) is
`five. What is worse, it could become more than five if
`some threads encounter branches at the same time. It
`
`Table 1: Functional unit and issue/result latencies of
`instructions
`
`functional
`unit
`
`category of
`instructions
`
`latency (cycle)
`issue
`result
`
`Integer
`ALU
`
`Barrel Shifter
`Integer
`Multiplier
`FP Adder
`
`add/subtract
`logical
`compare
`shift
`multiply
`divide
`add/subtract
`compare
`absolute/negate
`move
`FP Multiplier multiply
`divide
`convert
`load
`store
`
`FP Converter
`Load/Store
`
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`1
`2
`2
`
`2
`2
`2
`2
`6
`6
`4
`4
`2
`2
`8
`24
`4
`4
`(4)
`
`is obvious that the single thread performance is dam-
`aged. In our architecture, however, while some threads
`are blocked because of branches, other types of instruc-
`tions from other threads are executed. Consequently,
`the parallel multithreading scheme has a potential to
`hide the delay of branches as well as other arithmetical
`operation delays, and enhance machine throughput.
`
`2.1.3 Support of Concurrent Multithreading
`
`Each pair of general purpose and floating-point register
`sets, together with an instruction address save register
`and a thread status register, is conceptually grouped
`into a single entity called a context frame, and allocated
`to a thread. An instruction address save register and
`a thread status register are used as save areas for the
`program counter and program status words (which hold
`various pieces of thread-specific state), respectively.
`Since the processor has more context frames than
`thread slots, context switching is achieved rapidly by
`changing the logical binding between a logical proces-
`sor and a context frame. As long as the number of
`threads to be scheduled does not exceed the number
`of physical context frames, threads can be resident on
`the physical processor, reducing the overhead to save
`or restore contexts to/from memory.
`Another important element of the context frame is
`the access requirement buffer, which contains outstand-
`ing memory access requirements. When a thread in
`running state issues load/store instructions, these in-
`structions are copied and recorded in the access require-
`ment buffer, and deleted when they are performed.
`The processor is pipelined and standby stations en-
`able out-of-order execution of instructions. This makes
`
`4
`
`SAMSUNG-1013
`Page 4 of 10
`
`
`
`signed to the thread slot which had the highest priority
`before the rotation.
`The instruction schedule unit works in one of two
`implicit-rotation mode and explicit-rotation
`modes:
`mode. The mode is switched to the alternative mode
`through a privileged instruction.
`In the implicit-
`rotation mode, priority rotation occurs at a given num-
`ber of cycles(rotation interval), as shown in Figure 4.
`On the other hand, in the explicit-rotation mode, the
`rotation of priority is controlled by software. The rota-
`tion is done when a change-priority instruction is exe-
`cuted on the logical processor with the highest priority.
`There are two reasons why our architecture supports
`the explicit-rotation mode. One is to aid the compiler
`to schedule the code of threads executed in parallel
`when it is possible. The other is to parallelize loops
`which are difficult to parallelize using other architec-
`tures. These features are discussed in section 2.3.2 and
`2.3.3, respectively.
`
`2.3 Parallel Execution of a Single Loop
`
`2.3.1 Overview
`
`Parallel execution of a single loop is available on our
`multithreaded machine, by assigning iterations to log-
`ical processors. For example, in the case of a physical
`processor with n thread slots, the kth logical processor
`executes the ni+kth(i = 0, 1,···) iteration respectively.
`The explicit-rotation mode presented in section 2.2
`is one of the facilities supporting parallel execution of
`a loop. In this mode, a context switch due to data ab-
`sence is suppressed and a physical processor is occupied
`entirely by the threads from a loop. In the usual case,
`a change-priority instruction is inserted at the end of
`each iteration by the compiler.
`A fast-fork instruction generates the same number
`of threads as logical processors by setting its own next
`instruction address to program counters of other thread
`slots and activating them.
`It also sets unique logical
`processor identifiers to special registers corresponding
`to each logical processor. Each thread, therefore, can
`be informed which iterations it executes.
`In the case of doall type of loops,
`it is unneces-
`sary for logical processors to communicate with one
`another. But the execution of doacross type of loops re-
`quires communication between logical processors. One
`solution would be communication through memory.
`But in order to reduce the communication overhead,
`we provide the processor with queue registers. They
`are queue-structured registers connected to functional
`units.
`The connection topology among logical processors
`through queue registers is important when considering
`the trade-offs between hardware cost and its effect. One
`simple and effective connection topology is a ring net-
`
`5
`
`Selected
`
`higher
`
`: issued instruction
`
`B
`
`C
`
`A
`
`Time
`
`B
`
`C
`
`A
`
`C
`
`A
`Rotation
`Interval
`
`B
`
`A
`
`B
`
`C
`
`A
`
`B
`
`C
`
`Priority
`
`lower
`
`Figure 4: Dynamic instruction selection policy in in-
`struction schedule unit (A, B, and C denote thread
`slots.)
`
`the restart mechanism of programs somewhat complex.
`Before the thread is switched out, the logical processor
`should wait until all issued instructions but load/store
`instructions are performed. A load/store instruction is
`not always performed in short latency, and this is the
`main source of context switching in our architecture.
`Therefore, the load/store instructions in the execution
`pipeline could be flushed. But the restarting of threads
`is possible because such outstanding memory access re-
`quests are also saved as a piece of the context. When
`the thread is resumed, instructions in the access re-
`quirement buffer are decoded again and re-executed,
`being followed by the instruction indicated by the in-
`struction address save register. This mechanism also
`ensures imprecise but restartable interrupts on page
`fault. Mechanisms similar to this can be applied to
`floating-point exception handling.
`Context switching is triggered by a data absence trap.
`Detailed conditions concerning such traps must be con-
`sidered with care, because incorrect choices can create
`starvation and thrashing problems. We will report our
`solution including cache/memory architecture in an-
`other paper in the future.
`
`2.2 Instruction Scheduling Strategy
`
`In this section, we present the dynamic instruction se-
`lection policy in the instruction schedule unit. Simple
`selection policy is preferable, so as not to enlarge the
`instruction pipeline pitch.
`Figure 4 illustrates an example of the policy with
`multi-level priorities. A unique priority is assigned to
`each thread slot. Unless an issued instruction conflicts
`with other instructions from other thread slots to which
`higher priorities are assigned, it is selected by the in-
`struction schedule unit.
`In order to avoid starvation,
`the priorities are rotated. The lowest priority is as-
`
`SAMSUNG-1013
`Page 5 of 10
`
`
`
`compiler employs more judicious code scheduling tech-
`niques.
`For example, software pipelining[5, 6], which is an
`effective code scheduling technique for VLIW architec-
`tures, employs a resource reservation table to avoid re-
`source conflicts. Such a technique is also applicable to
`our architecture. But it might postpone the issue of
`instructions which would cause a resource conflict. So,
`straight use of the algorithm could miss opportunities
`to issue instructions. Consequently, we have developed
`a new algorithm which makes the most of the function
`of standby stations and instruction schedule units.
`Our algorithm employs not only a resource reserva-
`tion table but also a standby table whose entry corre-
`sponds to a standby station. Our algorithm differs from
`software pipelining in the point described below. When
`all of the instructions without dependencies at an issu-
`ing cycle have resource conflicts, a software pipeliner
`would generate a NOP code. Our code scheduler, how-
`ever, checks entries of the standby table for those in-
`structions. If an entry is not marked, the instruction
`is determined to be issued and the entry is marked.
`This means the instruction is stored in a standby sta-
`tion. The explicit-rotation mode enables the compiler
`to know which instruction is selected. The resource
`reservation table is not used only to determine if there
`is a resource conflict, but also to tell the compiler when
`the instruction in the standby station is executed.
`
`2.3.3 Eager Execution of Sequential Loop Iter-
`ations
`
`Using an example, this section shows that our architec-
`ture can parallelize loops which are difficult to paral-
`lelize using vector or VLIW architectures.
`The source code fragment of a sample program is
`shown in Figure 6. This program traverses a linear
`linked list. Although it is a simple example, it demon-
`strates the application of our execution scheme using a
`loop structure that is fundamental to many application
`programs.
`
`ptr = header;
`while ( ptr != NULL ) {
`tmp = a * ((ptr->point)->x)
`+ b * ((ptr->point)->y) + c;
`if ( tmp < 0 )
`break;
`ptr = ptr -> next;
`
`}
`
`Figure 6: A sample program (written in C)
`
`Each iteration is executed on a logical processor, and
`the value of ptr is passed through queue registers. The
`
`6
`
`Logical
`Processor
`write
`
`read
`
`0123
`
`Queue
`Register
`
`Logical
`Processor
`write
`
`read
`
`3210
`
`Queue
`Register
`
`Logical
`Processor
`write
`
`read
`
`0123
`
`31
`Register Set
`
`31
`Register Set
`
`31
`Register Set
`
`Queue Register
`
`Figure 5: Logical view of queue registers
`
`work, as shown in Figure 5. Many doacross type of
`loops encountered in practice have one iteration differ-
`ence. When a loop has more than one iteration differ-
`ence, data should be relayed through several logical pro-
`cessors. This could create overhead. Some techniques,
`however, are available to convert loops so that the it-
`eration difference is reduced to one. Loop unrolling is
`one such technique[3].
`Queue registers are enabled and disabled by spe-
`cial
`instructions. When enabled, two queue regis-
`ters, for reading from the previous and writing to the
`next logical processor respectively are mapped into two
`general-purpose or floating-point registers. The refer-
`ence to such registers is interpreted as the reference
`to queue registers, reducing data move overheads be-
`tween queue registers and general/floating-point regis-
`ters. Full/empty bits attached to queue registers are
`available as scoreboarding bits.
`
`2.3.2 Static Code Scheduling
`
`As mentioned in section 1, in the case of computer
`graphics programs, it is often difficult to predict the
`control sequence of a thread before execution, be-
`cause the sequence is determined dynamically by data-
`dependent branches. In such cases, in order to obtain
`high throughput, the compiler could employ no other
`techniques than a simple list scheduling algorithm[4].
`The compiler reorders the code without consideration
`of other threads, and concentrates on shortening the
`processing time for each thread. Short processing time
`of a single thread means a high issuing rate of instruc-
`tions because the number of instructions is constant
`whether scheduled or not. Though this scheduling ap-
`proach does not have control over resource conflicts, it
`aims at flooding functional units with sufficient instruc-
`tions by parallel execution of such scheduled code.
`On the other hand, in the case of numerical com-
`putation programs, it is often possible to predict the
`execution-time behavior of a loop. Therefore, the
`
`SAMSUNG-1013
`Page 6 of 10
`
`
`
`3 Estimation
`
`3.1 Simulation Model and Assumptions
`
`In the following four sections, we present simulation re-
`sults of our parallel multithreading architecture. Since
`cache simulation has not yet been implemented in our
`simulator, we assumed that attempts to access caches
`were all hit.
`We used two kinds of functional unit configurations
`(see Figure 2). One consists of seven heterogeneous
`units. The other consists of those units with another
`load/store unit. In practice, a data cache might consist
`of several banks in order to be accessed simultaneously
`by two load/store units. But our simulation assumed
`there was no bank conflict. Latencies of each instruc-
`tion are listed in Table 1. All machines simulated in
`this paper are not equipped with delayed branch mech-
`anisms, branch prediction facilities, or overlapped reg-
`ister windows.
`In order to estimate our architecture, we use the
`speed-up ratio as a criterion. It is defined as the pro-
`portion of total cycles required by multithreaded exe-
`cution to those by sequential execution. Sequential ex-
`ecution means the execution of the sequential version of
`program on a RISC based processor whose instruction
`pipeline is shown in Figure 3(b).
`
`3.2 Speed-up by Parallel Multithread-
`ing
`
`This section presents simulation results of our multi-
`threaded machine. As an application program, we have
`chosen a ray-tracing program, which can be easily par-
`allelized at every pixel. The program written in C was
`compiled by a commercial RISC compiler with an op-
`timization option. The object code was executed on
`a workstation, and traced instruction sequences were
`translated to be used for our simulator.
`Table 2 lists speed-up ratios when the rotation in-
`terval in the instruction schedule unit is eight cycles.
`In the case of a processor with one load/store unit,
`an 1.83 times speed-up is gained by parallel execution
`of two instruction streams, although all of the hard-
`ware of a single-threaded processor is not duplicated in
`the processor. A performance improvement of a factor
`of 2.89 is still possible with four thread slots, though
`it is less effective (2.89/1.83=1.58) than the improve-
`ment attained by increasing from one to two thread
`slots (1.83). When eight thread slots are provided, the
`utilization of the busiest functional unit, the load/store
`unit, becomes 99%. This is the reason why the speed-
`up ratio is saturated at only 3.22 times. Addition of
`another load/store unit improves speed-up ratios by
`10.4∼79.8%.
`
`Iteration
`
`ptr
`
`ptr’
`
`ptr’’
`
`ptr’’’
`
`Time
`
`ptr’’’’
`
`ptr’’’’’
`
`kill other threads
`
`Figure 7: Parallel execution of while loop
`
`compiler generates multiple versions for a variable ptr
`so that iterations can be initiated without having to
`wait for preceding iterations to complete. Such opti-
`mization with respect to a variable is a variation of the
`use of variables in dataflow machines. Figure 7 illus-
`trates the parallel execution scheme. Each thread re-
`ceives the value of ptr from the thread executed in the
`immediately preceding iteration, and passes the next
`value to the thread executed in the immediately suc-
`ceeding iteration. After that, the thread continues to
`execute the rest of the loop body.
`A thread can initiate an iteration which might not
`have been executed in the sequential execution. So, we
`call this scheme eager execution.
`The instruction selection policy in instruction sched-
`ule units plays an important part in preserving the se-
`mantics of the sequential program. The execution of an
`iteration is not acknowledged until the thread gets the
`highest priority. It is the thread with the highest pri-
`ority who executes the earliest iteration remaining at
`that point. By using multilevel priority, the instruction
`selection policy guarantees the execution of succeeding
`iterations does not hinder the execution of preceding
`iterations.
`When a thread has exited from the loop, it tries to
`stop the operation of other thread slots and to kill other
`running threads. Such a special instruction is valid only
`for the thread with the highest priority. When a thread
`without the highest priority tries to execute this in-
`struction, it is put into an interlocked state until it is
`given the highest priority.
`If the variable tmp is global and alive after the loop,
`the compiler should use the special store instruction
`which is performed only by the thread with the highest
`priority. This guarantees that writes to the memory
`location of tmp are performed in order with respect to
`the source code. Some instructions are also provided to
`ensure consistency between contexts of threads before
`and after the execution of the loop.
`
`7
`
`SAMSUNG-1013
`Page 7 of 10
`
`
`
`Table 2: Speed-up ratio by parallel multitthreading
`
`with one
`load/store unit
`no. of
`thread without with
`slots
`standby
`standby
`station
`station
`1.79
`1.83
`2.84
`2.89
`3.22
`3.22
`
`2
`4
`8
`
`with two
`load/store units
`without with
`standby
`standby
`station
`station
`2.01
`2.02
`3.68
`3.72
`5.68
`5.79
`
`Table 3: Tradeoff between speed-up and employed par-
`allelism (speed-up ratio)
`
`total no. of
`issued instruc-
`tions (D×S)
`2
`4
`8
`
`no. of issued instructions
`from each thread slot (D)
`1
`2
`4
`8
`2.02
`1.31
`—
`—
`3.72
`2.43
`1.52
`—
`5.79
`4.37
`2.79
`1.66
`
`Standby stations improve the speed-up ratio by
`0∼2.2%. This low improvement is due to poor paral-
`lelism within an instruction stream. In the case of ap-
`plication programs whose thread is rich in fine-grained
`parallelism, greater improvement can be achieved.
`We also simulated a machine whose thread slots are
`provided with private instruction caches and instruc-
`tion fetch units. Such organization does provide a slight
`speed-up. Achieved speed-up ratios are the same as
`those in Table 2, except that 1.79 and 5.79 is respec-
`tively replaced by 1.80 and 5.80. This shows that the
`delay due to instruction fetch conflicts is hidden and
`sharing an instruction cache between thread slots is
`possible.
`We also examined the execution cycles with various
`rotation intervals (2n cycles, where n is 0∼8). But, in
`this application program, rotation interval did not have
`much influence on the performance, though using eight
`or sixteen cycles seems slightly superior.
`
`3.3 Multithreading with Superscalar
`Design
`
`In our simulator, each thread slot can support multi-
`ple instruction issuing from a single instruction stream
`similar to superscalar architectures. In this section, us-
`ing the same program as in section 3.2, we discuss hy-
`brid architectures which employ both coarse- and fine-
`grained parallelism.
`A hybrid processor is represented here by (D,S)-
`processor, where D is the maximum number of instruc-
`tions which a thread slot issues each cycle, and S is
`the number of thread slots. For a fair comparison, the
`hardware cost of (d,s)-processor is almost as same as
`that of (1,d×s)-processor. For example, the instruction
`fetch bandwidth is d×s words per cycle in both proces-
`sors. Although (d×2,s/2)-processor is provided with a
`half register set of (d,s)-processor, it requires double the
`number of register ports. Each thread slot checks the
`dependencies of instructions in the instruction window
`of size D, and issues instructions without dependencies
`every cycle. The instruction window is filled every cy-
`
`8
`
`cle. Some techniques[7, 8] are proposed to boost the
`superscalar performance, but they are not provided for
`the thread slot because