`
`399
`
`Multithreading with Distributed Functional Units
`
`Bernard K. Gunther, Member, IEEE
`
`Abstract—Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide
`latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to
`maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional
`units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units
`may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater
`exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied
`specifically in the Concurro processor architecture—a machine supporting multiple thread contexts and capable of context switching
`asynchronously in response to dynamic data and resource availability.
`Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly
`can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and
`functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better
`performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be
`avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads.
`Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units,
`and enables saturation of the most critical processor resources.
`
`Index Terms—Distributed functional units, hardware utilization, latency tolerance, multiple context processors, multithreading,
`pipelined computers, pre-access instruction cache, simulation, synchronization.
`
`—————————— F ——————————
`
`T
`
`1 INTRODUCTION
`HE principal operating objective for a multithreaded
`processor is to have a sufficient number of parallel tasks
`multiplexed onto the hardware so as to eliminate or mini-
`mize machine idle time in the presence of long-latency op-
`erations. Main memory accesses, remote reads, or synchroni-
`zation operations take, typically, orders of magnitude longer
`to complete than simple instructions, thus useful gains in
`processor utilization can be achieved by overlapping long-
`latency operations with the execution of unrelated tasks. A
`multiple context processor incorporates hardware support
`for enhancing the efficiency of multithreading, with the em-
`phasis being on making fine grained and dynamic mul-
`tithreading viable. The processing states or contexts of sev-
`eral concurrently executing threads are maintained by the
`processor, and context switching is handled with some de-
`gree of automation to reduce overheads.
`Historically, multiple context processors such as the HEP
`[17], MASA [13], PRISC-1 [24], Monsoon [25], and Sparcle
`[1] employed multithreading as a means of sustaining ac-
`tivity in a single instruction pipeline. The execution stages
`of the pipeline in each of these machines are served by a
`pipelined arithmetic/logic and load/store unit complex, in
`a microarchitecture, we will refer to as one of centralized
`functional units. This paper examines the more general
`microarchitecture having distributed functional units in sup-
`port of multiple instruction pipelines, that is, each func-
`
`————————————————
`• The author is with the Advanced Computing Research Centre, School of
`Computer and Information Science, University of South Australia, The
`Levels, Adelaide SA 5095, Australia. E-mail: gunther@cis.unisa.edu.au.
`Manuscript received 10 July, 1995; revised 17 Apr., 1996.
`For information on obtaining reprints of this article, please send e-mail to:
`transcom@computer.org, and reference IEEECS Log Number C96330.
`
`tional unit is dedicated to not only a particular operation,
`but also to a subset of the processor's pipelines.
`An extreme case of functional unit distribution occurs in
`a single-chip multiprocessor, where each processing ele-
`ment (PE) on the chip is equipped with both private cache
`memory and dedicated functional units. Despite the con-
`ceptual simplicity of this organization, general-purpose
`multiprocessors of useful scale remain difficult to imple-
`ment on a single chip [23], and therefore suffer from rela-
`tively poor cost/performance ratios when die yield is taken
`into account. Adding multithreading capability for latency
`tolerance only exacerbates the implementation difficulties.
`The inefficiencies of the multiprocessor approach stem from
`the almost complete absence of resource sharing, particu-
`larly with regard to the caches and floating point arithmetic
`units. However, if any other architecture is to be a realistic
`alternative, it must attain the performance potential of a
`multiprocessor.
`Variations on superscalar [26], [27] VLIW (very long in-
`struction word) [18], [29], and MIMD (multiple instruc-
`tion/multiple data) [15] techniques used in conjunction
`with multithreading or multistreaming have been proposed
`to take advantage of the parallelism inherent in distributed
`functional unit organizations. The VLIW approach is typi-
`fied by Keckler and Dally’s processor coupling mechanism
`[18], adopted in the M-machine. Several statically sched-
`uled VLIW threads are multiplexed at clock cycle granular-
`ity onto the functional units, with the independence of
`VLIW operations being exploited to achieve multiple re-
`sults in each cycle. Even though programs must be com-
`piled to suit a particular hardware configuration, some of the
`scheduling burden is eliminated by allowing the execution of
`
`0018-9340/97/$10.00 © 1997 IEEE
`
`SAMSUNG-1011
`Page 1 of 13
`
`
`
`400
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`thread operations to proceed with limited asynchronicity.
`While the VLIW approach captures instruction-level paral-
`lelism, it does so at the cost of substantial hardware con-
`texts and attendant resource duplication, including multi-
`ple caches, which, ultimately, limit scalability and latency
`tolerance. Hirata et al. proposed a MIMD mode of opera-
`tion for a multithreaded architecture [15]. A number of
`thread slots (instruction queue and sequencer units) can
`dispatch instructions simultaneously to a shared, heteroge-
`neous functional unit set. A central register file is parti-
`tioned into context banks, a subset of which is dynamically
`assigned to the thread slots. Since each thread slot is as-
`signed to a single bank, and context switches occur on
`cache misses only, the more usual functional unit latencies
`remain exposed. The combination of coarse multithreading
`and complex register file access resulted in instruction
`throughputs remaining 40% below peak rates for even
`highly parallel workloads and infinite caches.
`The objectives of this paper are to demonstrate solutions
`to the hardware utilization, latency tolerance, and pro-
`grammability problems that have hampered previous de-
`signs, and to characterize Concurro, a multithreaded ar-
`chitecture possessing distributed functional units [10]. The
`primary design goal for Concurro is achieving a scalable
`microprocessor having the performance characteristics of a
`multiprocessor without sacrificing the utilization and pro-
`gramming advantages of conventional multiple context
`architectures. Multithreading and cost-sensitive resource
`sharing in a distributed functional unit architecture are the
`keys to meeting this goal. Although the distinction between
`uni- and multiprocessing is blurred in Concurro, this paper
`considers the machine in the uniprocessor sense only, and
`as such, it may be regarded as a PE in a larger multiproces-
`sor system.
`In the following section, we will examine the processor
`design space, arriving at the architecture and organization
`of Concurro processors. Section 3 establishes the simulation
`study used in subsequent sections to evaluate behavior and
`design trade-offs. Scalability and organization alternatives
`are assessed in Section 4. Section 5 examines the issues re-
`lating to instruction handling, while Section 6 is concerned
`with cost and performance aspects of the datapath. Sec-
`tion 7 draws together the main results of the study.
`
`2 THE CONCURRO ARCHITECTURE
`Concurro is a multiple context, von Neumann processor
`architecture. A Concurro processor is notionally organized
`as a shared memory multiprocessor, but with the tradi-
`tional PE boundaries dissolved. Any instruction sequence,
`with few exceptions, constitutes a valid thread. This general
`MIMD programming model allows the processor to be op-
`erated effectively in either single threaded or multithreaded
`conditions.
`
`2.1 Instruction Set and Organization
`Concurro adopts a 64-bit, register-oriented, load/store in-
`struction set architecture (ISA) with additional support for
`thread control and fine-grained synchronization. Apart
`from its benefits to optimizing compilers, the register-
`
`oriented ISA greatly reduces the implementation complex-
`ity of multiple context processors, and facilitates the ex-
`ploitation of concurrency at all levels. Each processor con-
`text consists of a program counter and 16 64-bit local, gen-
`eral-purpose registers, with all contexts having access to a
`set of 16 64-bit global registers, one of which is hard-wired
`to zero. The number of local registers is constrained pri-
`marily by instruction encoding limits, but also by imple-
`mentation cost and performance considerations. Sufficient
`bandwidth to the global registers can be achieved without
`excessive multiporting by maintaining a number of consis-
`tent copies of the global register file, each copy serving a
`subset of the contexts. The instruction set includes arithme-
`tic and logical operations on a variety of data types, ranging
`from bytes (8 bits) to double-precision floating point num-
`bers or long words (64 bits). Data may be accessed via an
`on-chip data cache or directly from the pipelined main
`memory, to suit the locality characteristics of particular
`workloads.
`The architecture allows any particular implementation to
`contain a fixed number of processor contexts, each of which
`can be considered a virtual processor, but programmers are
`discouraged from assuming either scheduling policy or
`context numbers in programs. Despite the programming
`difficulties associated with fixed context multithreading,
`such implementations are preferred in the interests of pro-
`moting effective use of the storage hierarchy—especially at
`the register level [5]. At all times, the processor is executing
`the root thread, which gives rise to other threads by activat-
`ing dormant contexts in fork operations. Any thread is per-
`mitted to fork threads of its own, provided that the supply
`of contexts is not exhausted, otherwise, the fork operation
`blocks until a fresh context becomes available. Such block-
`ing, given a finite number of contexts, can potentially cause
`deadlock when fork dependencies arise, thus realistic proc-
`essor implementations would be expected to raise either a
`scheduling or time-out exception to allow software inter-
`vention in this case. Threads other than the root thread are
`permitted to terminate themselves explicitly, rather than
`terminating in join synchronizations, so as to minimize the
`frequency of context destruction and recreation. Inter-
`thread communication and synchronization is effected by
`special register and memory operations.
`A block diagram of a Concurro processor is shown in
`Fig. 1. The processor contexts, labeled “C,” are clustered
`into small groups in order to localize and thus speed up
`resource contention during instruction dispatch. Each con-
`text group can be viewed as the core of a multithreaded PE,
`with some resources being shared among contexts in the
`group. Each context group independently dispatches in-
`structions to virtual functional units connected to the group's
`operand bus. In the case of complex or infrequently used
`functions, the virtual functional units are, in fact, interfaces
`to physical functional units, which are shared by a subset of
`the context groups as indicated in Fig. 2. Operand queues—
`or simple latches—on the virtual functional unit inputs de-
`couple instruction dispatch, by context groups, from in-
`struction issue, by the fewer physical functional units.
`Physical functional units are generally pipelined, capable of
`accepting a new pair of 64-bit operands and an operation
`
`SAMSUNG-1011
`Page 2 of 13
`
`
`
`GUNTHER: MULTITHREADING WITH DISTRIBUTED FUNCTIONAL UNITS
`
`401
`
`code in every cycle. Routing of results is facilitated by tag-
`ging each instruction with the context identifier and register
`number of its destination. Results are delivered to their
`destination register files via a group result multiplexor and
`result bus at the maximum rate of one long word per cycle
`per group.
`The instruction cache subsystem consists of an on-chip
`primary cache backed by an external secondary cache of
`substantially larger capacity. Concurro’s data access in-
`structions are supported by load and store buffers con-
`necting main memory, an on-chip data cache, and a syn-
`
`Fig. 1. Concurro processor block diagram.
`
`Fig. 2. Virtual functional unit interface to physical functional units.
`
`chronization controller. In order to isolate the processor's
`performance from the implementation of main memory, the
`simulation study in this paper assumes the main memory
`to be perfectly pipelined and capable of supplying one long
`word per cycle per memory port. Such memory character-
`istics can be approached by careful interleaving and access
`buffering [28], but the provision of adequate memory
`bandwidth will remain difficult in high performance ma-
`chines of all kinds.
`
`2.2 Processor Contexts
`Context units maintain the processor contexts and control the
`sequencing of threads. Each unit consists of a triple-ported
`register file, program counter, instruction buffer and prede-
`coder, and control logic. Following a fork instruction, the
`processor’s thread control block will place a thread starting
`address on a control bus, and enable a previously inactive
`context; initially, only a context’s program counter is defined.
`Load balancing across context groups is achieved by regis-
`tering the number of active contexts in each group, and ena-
`bling any new context in the group containing the fewest
`active members. Once enabled, a context will continue to
`execute instructions until the running thread terminates it-
`self, or until the operating system terminates the thread pre-
`emptively. Aside from the placement of synchronization
`points, software can exercise no control over the scheduling
`of threads following their creation.
`Since a large number of context units can be expected in
`a high performance machine, an inexpensive instruction
`scheduling mechanism is normally employed, although
`Section 5 considers more sophisticated alternatives. In-
`struction dependencies are enforced through a Cray-style
`register scoreboard [16]. The instruction at the head of the
`instruction buffer is predecoded to determine resource re-
`quirements, and if the instruction is free of data and struc-
`tural hazards (at the virtual functional unit level), the con-
`text unit contends for instruction dispatch in the current
`cycle. Thus, threads can switch between blocked and run-
`ning conditions on a cycle-by-cycle basis as instruction de-
`pendencies allow. A fair arbiter schedules asynchronously
`one of the ready instructions in the group for dispatch.
`Since blocked or absent threads do not contend for dis-
`patch, the processor is able to maximize instruction
`throughput under conditions of low program parallelism.
`This capability is further enhanced by the provision of a
`bypass around the register write back stage.
`A schematic of the processor pipeline appears in Fig. 3, for
`the case of a two-cycle execution latency. Owing to buffering
`in the contexts, the instruction cache access stage is normally
`not entered in every cycle; subsequent pipestages, though,
`are entered by every executed instruction. The number of
`execution stages is variable, depending on the required
`functional unit and its activity. However, all functional
`units contend and announce result write back one-half cy-
`cle prior to completing an operation in order to give context
`units the opportunity for bypassing the register write back
`stage in the following cycle.
`As each context unit (or context group, at least) may
`contain a branch unit, it is desirable to minimize the cost of
`these units. Accordingly, Concurro’s branch and fork targets
`
`SAMSUNG-1011
`Page 3 of 13
`
`
`
`402
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`sets n and n + 1 simultaneously can be avoided simply by
`enabling line-crossing accesses to a Random Pre-access
`Memory (RPM) instruction store [9]. A RPM accesses, with
`wrap-around, a contiguous block of words starting at any
`address, therefore only words W3-W6 are accessed and re-
`turned from a RPM-based cache. The cache controller at-
`tempts to allocate adjacent cache lines to contiguous mem-
`ory addresses, and marks tags of running lines to indicate
`whether words accessed from the next set belong to the
`requested prefetch block.
`
`Fig. 3. Instruction pipeline.
`
`are specified in semi-absolute form: a low-order segment of
`the absolute address coupled with a relative direction flag
`to provide position-independence [8]. Semi-absolute branch
`units offer speed and area advantages, since they can be
`constructed without address adders. A static branch pre-
`diction scheme—anticipating that only backward branches
`will be taken—minimizes average branch latencies in the
`absence of branch delay slots.
`
`2.3 Cache Subsystems
`Caches remain useful even in a multithreaded processor,
`particularly where high reference locality can be relied
`upon or program parallelism is insufficient for effective
`latency hiding [21]. To maximize cache throughputs in the
`presence of multiple threads, however, both the instruction
`and data caches are nonblocking [19], allowing several
`missed accesses to be queued and serviced while hit ac-
`cesses take place.
`Satisfying the instruction bandwidth requirements poses
`a challenge in terms of containing implementation cost.
`This paper compares two extreme approaches—a single
`primary instruction cache and a private I-cache per context
`group. The multiple I-cache option, although the most
`costly, is straight-forward in design and guarantees the de-
`livery of a fresh instruction for every one dispatched by a
`group. The independence of fetches across different threads
`complicates the single I-cache option. To avoid multiport-
`ing the cache (and replicating controller logic), blocks of
`instructions can be prefetched in parallel from a single I-
`cache, onto which prefetch requests from depleted instruc-
`tion buffers are multiplexed. Adequate bandwidth is
`achieved by buffering as many prefetched instructions as
`can be dispatched by the entire machine in one cycle, plus
`an additional 25% to compensate for loss on jumps.
`A major drawback of parallel prefetching, however, is
`the possibility of degraded cache returns caused by una-
`ligned accesses, as illustrated in Fig. 4. When the block of
`four instruction words, W3-W6, is requested from a normal
`cache, the appearance of the cache line boundary truncates
`the returned block to the single word W3, even though the
`remainder of the block is likely to be present in the cache.
`The persistence of inadequate cache returns would ulti-
`mately starve the context units and lower performance.
`Forcing address alignment by padding basic blocks with
`no-op instructions, or dual-porting the cache to access line
`
`Fig. 4. Using line crossing to improve parallel prefetching performance
`with unaligned addresses.
`
`The bandwidth requirement for loads and stores is less
`severe than that for instruction fetches, allowing a more
`conventional organization of the data cache. Experimental
`results suggest that one 64-bit D-cache port for every four
`context groups is normally adequate; in the case of multiple
`ports, the address scattering caused by many threads can be
`exploited in a banked organization less expensive than
`genuine multiporting. The D-cache is backed directly by the
`main memory. Cache performance under conditions of
`heavy miss rates and long memory latency is enhanced by
`splitting miss servicing into separate request and reload
`phases, and pipelining the requests sent to main memory.
`For unit-stride access patterns this scheme delivers the
`greatest part of the main memory bandwidth to mul-
`tithreaded applications operating out of cache. Neverthe-
`less, degraded locality caused by multiple threads in com-
`petition may still force a compiler to adopt (or be instructed
`to adopt) uncached loads and stores.
`
`2.4 Synchronization Facilities
`Synchronization delays are often long and unpredictable,
`therefore, in many multithreaded machines, busy waiting
`after a synchronization is avoided by suspending the syn-
`chronizing thread—removing it from the scheduling pool—
`and scheduling a context switch. While this strategy may be
`effective for coarse-grained programs, for fine-grained
`processing a regular context switch may be cheaper than
`suspending a thread when synchronizations can be ex-
`pected to complete within several scheduling quanta [2]. In
`Concurro, the fine-grained approach is embraced, with
`
`SAMSUNG-1011
`Page 4 of 13
`
`
`
`GUNTHER: MULTITHREADING WITH DISTRIBUTED FUNCTIONAL UNITS
`
`403
`
`producing accurate results with regard to cycle counts. To
`this end, fixed time-step, discrete-event simulation was per-
`formed in such a way that processor components critical to
`timing accuracy—bus switching and pipeline interlock, for
`instance—were simulated at register transfer level, whereas
`a behavioral simulation sufficed for components having
`parameterized latency—such as the functional units. The
`pipeline diagram in Fig. 3 indicates, in terms of cycle peri-
`ods, the delays assigned to the processor pipestages, with
`Table 1 listing the latency parameters for the execution
`stages and memory.
`
`both inter and intrathread synchronization being enforced
`via the register scoreboard mechanism; this enables a thread
`to continue execution past some synchronization instructions
`until it reaches its actual synchronization point, beyond
`which the asynchronous scheduler will ignore the thread
`while its synchronization completes. Two levels of synchro-
`nized data communications are available to programmers:
`register channel transfers [11], and on a larger namespace, I-
`structure [3], and M-structure [4] memory operations.
`Thirty-two register channels, each of which is a 64-bit reg-
`ister with an associated presence bit, are accessible through
`explicit read and write instructions, whose dispatch may be
`interlocked by the presence bit of a named channel. Threads
`can block on either channel writes or reads, depending on
`whether previous data had been consumed or fresh data
`written, respectively. The limited supply of register channels
`necessitates that they be allocated, like the general-purpose
`registers, with the objective of re-use. Accordingly, register
`channels are generally confined to thread parameter passing
`and synchronization in small thread ensembles.
`The synchronization controller, indicated in Fig. 1, is a
`microcoded control unit that cooperates with the D-cache to
`realize an I-structure/M-structure store. Synchronization
`can occur at the level of individual long word locations in
`the structure store, each of which appears either as empty
`or, when holding valid data, as full. Instead of dedicating a
`special tagged memory to the structure store, the D-cache
`holds structures, and their states are maintained through
`software convention. A non-full structure is split into two
`32-bit fields—a pointer field and a state field in the most
`significant bits. The contents of the state field are restricted
`to a set of values such that no floating point number or cor-
`rectly sign-extended word-sized integer generated by the
`processor can be interpreted as empty structure state. A
`structure is cleared by writing the empty value to its state
`field using an ordinary cached store instruction. The syn-
`chronization controller defers loads from empty structures
`by queuing the requests at the structure, with the common
`case of a single deferred load being optimized by storing
`the 10-bit result destination tag in the structure’s pointer
`field and setting the state field accordingly. A system reg-
`ister points to a free list of long words in the user's address
`space from which nodes (each a 32-bit link pointer and 10-
`bit tag) are obtained for deferred lists. A synchronizing
`store to an I-structure causes all deferred loads to be com-
`pleted before the structure is updated to full, whereas for an
`M-structure mutual exclusion is achieved by permitting
`only a single deferred load to complete, which in turn resets
`the structure to empty. Filled I-structures are read nonde-
`structively, with negligible overhead.
`
`3 EXPERIMENTAL METHOD
`3.1 Simulation Procedure
`Concurro processors and alternative organizations were
`evaluated through execution-driven simulation. A recon-
`figurable simulator, written in C, accepted executable bi-
`nary files and generated execution statistics and traces di-
`rectly under control of a script. The processor simulator
`was designed to simulate at the level of detail necessary for
`
`TABLE 1
`SOME BASELINE PROCESSOR CONFIGURATIONS
`Configuration
`4G16C
`4
`16
`25 cycles
`16/256 kB
`5 words
`32 kB
`1
`2
`1 cycle
`2
`1 cycle
`1
`1 cycle
`1
`4 cycles
`1
`5 cycles
`18 cycles
`
`Parameter
`Context groups
`Maximum contexts
`Main memory latency
`I-cache size (on/off chip)
`Instruction preaccess
`D-cache size
`Memory, D-cache ports
`Logical units
`Logical latency
`Integer arithmetic units
`Integer word-add latency
`Shift units
`Shift latency
`F.P. add units
`F.P. add latency
`F.P. multiply units
`F.P. multiply latency
`F.P. divide latency
`
`2G8C
`2
`8
`25 cycles
`16/256 kB
`3 words
`32 kB
`1
`1
`1 cycle
`1
`1 cycle
`1
`1 cycle
`1
`4 cycles
`1
`5 cycles
`18 cycles
`
`8G32C
`8
`32
`25 cycles
`16/256 kB
`10 words
`32 kB
`2
`4
`1 cycle
`4
`1 cycle
`2
`1 cycle
`2
`4 cycles
`2
`5 cycles
`18 cycles
`
`All transfers of data onto buses and through pipeline
`registers were coordinated by a global two-phase simula-
`tion clock: With phase 2 active, data was normally set up to
`be valid by the end of the phase, and then latched during
`phase 1 of the cycle. This strategy not only resolved read-
`after-write dependencies throughout the simulator, but also
`served to catch potential errors associated with structural
`hazards and cycle time limitations. In the case of contention
`for shared resources, round-robin arbiters selected candi-
`dates asynchronously (through combinational logic) during
`the whole of phase 2, with choices determining subsequent
`arbiter states in phase 1. Grouping of contexts and instruc-
`tions limited arbitration complexity to one-of-eight decisions.
`Ports of the main memory were accessed via split ad-
`dress/data buses, allowing data from previous reads to be
`returned simultaneously with fresh read requests on every
`cycle. Signaling of bus activity enabled read data to be
`latched into load silos, as well as permitting writes on data
`bus cycles during which no read data was present. It was
`assumed that the memory controller would process ac-
`cesses in the order they were received, although the
`load/store functional unit reordered processor accesses for
`the purpose of maximizing utilization of the split buses.
`Processor configurations are primarily governed by the
`datapath bandwidth required to sustain throughput under
`parallel operation. Thus the number of distributed func-
`tional units and memory ports must increase above the
`architectural minimum when multiple context groups are
`
`SAMSUNG-1011
`Page 5 of 13
`
`
`
`404
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`implemented. Table 1 indicates the salient parameters of
`three configurations used as baseline microarchitectures in
`the following sections. These configurations are intended to
`be consistent with scalability at best cost, and therefore rep-
`resent neither maximum performance nor optimum utili-
`zation points. Memory and functional unit latencies were
`based on a processor cycle time of 4 ns. The scaling of dis-
`tributed functional units was in most cases derived from
`experimental measures of instruction frequencies, rather
`than estimates. Although peak bandwidth demands can
`temporarily exceed the available bandwidth from certain
`functional units, stalls are avoided through operand queues
`on virtual functional units (Fig. 2), each queue buffering up
`to Ø G/Fø instructions, where G is the number of context
`groups and F is the number of physical functional units
`belonging to a given instruction class.
`For each benchmark, a single executable file was pro-
`duced and run without modification on all processor con-
`figurations. Binary compatibility was achieved by making
`few specific assumptions on context availability, and en-
`suring that thread fork overheads were adequately covered
`by computational work. The simulation results presented in
`later sections are normalized against the performance of the
`baseline configurations in Table 1.
`
`3.2 Benchmark Programs
`Six benchmark programs, consisting of small, realistic ap-
`plications and synthetic benchmark programs, were used
`for the experimental evaluation. The programs were com-
`piled manually into assembly code—a process which limited
`the practical complexity of the benchmarks, but allowed
`more aggressive optimizations. When architecturally similar
`machines are compared, the effects of manual compilation
`are largely factored out. The most low-level aspects of pro-
`gramming, such as basic block instruction scheduling and
`optimization of instruction selection, were automated in the
`assembler. Each benchmark is described briefly below:
`• carries—Highly parallel integer search that updates a
`M-structure histogram with the carry run lengths for
`all possible 16-bit binary additions; carries was
`adapted from an original C source, with the main
`loop body coded as a thread.
`• compress—Single threaded implementation of the
`UNIX “compress” utility, compressing a 22 kB file
`using adaptive Lempel-Ziv coding; compress was di-
`rectly translated from its C source.
`• gauss—Solution of a 200 · 200 system of linear equa-
`tions through Gaussian elimination with partial-
`pivoting and back-substitution; gauss was manually
`coded in assembly language to take advantage of M-
`structures for locking matrix rows, and I-structures in
`the parallel back-substitution phase.
`• loops—Loops 1, 3, 5, 6, 7, 10, 11, and 12 of the Liver-
`more FORTRAN Kernels [22] interspersed by barrier
`synchronizations; loops was translated from FOR-
`TRAN source, and exploited I-structures in parallel-
`ized recurrence relations.
`• matmult—Multiplication of two 100 ·
`100 double-
`precision matrices, producing a separate product matrix;
`
`Fig. 5. Dynamic instruction mix of the benchmark programs.
`
`matmult was manually coded, using a blocked algorithm
`with 2 · 2 tiles.
`• tf—Text formatter that reads a 22 kB file and outputs
`the text as fully justified paragraphs; the C source for
`tf was recoded as a pipeline of four stages across four
`threads, each thread communicating via FIFO data
`structures.
`Table 2 lists the principal characteristics of the benchmarks.
`The dynamic thread counts indicate that all benchmarks ex-
`cept compress take advantage of multithreading; the average
`numbers of instructions executed between potentially suspen-
`sive synchronizations give rise to the mean thread run lengths
`shown. Cache and main memory activities for configuration
`8G32C are included to illustrate that direct access to main
`memory successfully averted data cache bottlenecks (at most
`two cache accesses per cycle are available in 8G32C). Fig. 5
`breaks down the dynamic instruction counts by operation
`type. Programs carries, compress, and tf represent integer
`workloads, characterized by cached data access and short ba-
`sic blocks. Scientific computation is represented by gauss, loops,
`and matmult, which are substantially vectorizable, floating
`point codes that bypass the data cache for vector and matrix
`access. For all programs, parallel loops are unrolled a maxi-
`mum of four iterations, and the statically rescheduled loop
`bodies are formed into threads for forking by a control thr