throbber
IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`399
`
`Multithreading with Distributed Functional Units
`
`Bernard K. Gunther, Member, IEEE
`
`Abstract—Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide
`latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to
`maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional
`units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units
`may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater
`exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied
`specifically in the Concurro processor architecture—a machine supporting multiple thread contexts and capable of context switching
`asynchronously in response to dynamic data and resource availability.
`Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly
`can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and
`functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better
`performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be
`avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads.
`Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units,
`and enables saturation of the most critical processor resources.
`
`Index Terms—Distributed functional units, hardware utilization, latency tolerance, multiple context processors, multithreading,
`pipelined computers, pre-access instruction cache, simulation, synchronization.
`
`—————————— F ——————————
`
`T
`
`1 INTRODUCTION
`HE principal operating objective for a multithreaded
`processor is to have a sufficient number of parallel tasks
`multiplexed onto the hardware so as to eliminate or mini-
`mize machine idle time in the presence of long-latency op-
`erations. Main memory accesses, remote reads, or synchroni-
`zation operations take, typically, orders of magnitude longer
`to complete than simple instructions, thus useful gains in
`processor utilization can be achieved by overlapping long-
`latency operations with the execution of unrelated tasks. A
`multiple context processor incorporates hardware support
`for enhancing the efficiency of multithreading, with the em-
`phasis being on making fine grained and dynamic mul-
`tithreading viable. The processing states or contexts of sev-
`eral concurrently executing threads are maintained by the
`processor, and context switching is handled with some de-
`gree of automation to reduce overheads.
`Historically, multiple context processors such as the HEP
`[17], MASA [13], PRISC-1 [24], Monsoon [25], and Sparcle
`[1] employed multithreading as a means of sustaining ac-
`tivity in a single instruction pipeline. The execution stages
`of the pipeline in each of these machines are served by a
`pipelined arithmetic/logic and load/store unit complex, in
`a microarchitecture, we will refer to as one of centralized
`functional units. This paper examines the more general
`microarchitecture having distributed functional units in sup-
`port of multiple instruction pipelines, that is, each func-
`
`————————————————
`• The author is with the Advanced Computing Research Centre, School of
`Computer and Information Science, University of South Australia, The
`Levels, Adelaide SA 5095, Australia. E-mail: gunther@cis.unisa.edu.au.
`Manuscript received 10 July, 1995; revised 17 Apr., 1996.
`For information on obtaining reprints of this article, please send e-mail to:
`transcom@computer.org, and reference IEEECS Log Number C96330.
`
`tional unit is dedicated to not only a particular operation,
`but also to a subset of the processor's pipelines.
`An extreme case of functional unit distribution occurs in
`a single-chip multiprocessor, where each processing ele-
`ment (PE) on the chip is equipped with both private cache
`memory and dedicated functional units. Despite the con-
`ceptual simplicity of this organization, general-purpose
`multiprocessors of useful scale remain difficult to imple-
`ment on a single chip [23], and therefore suffer from rela-
`tively poor cost/performance ratios when die yield is taken
`into account. Adding multithreading capability for latency
`tolerance only exacerbates the implementation difficulties.
`The inefficiencies of the multiprocessor approach stem from
`the almost complete absence of resource sharing, particu-
`larly with regard to the caches and floating point arithmetic
`units. However, if any other architecture is to be a realistic
`alternative, it must attain the performance potential of a
`multiprocessor.
`Variations on superscalar [26], [27] VLIW (very long in-
`struction word) [18], [29], and MIMD (multiple instruc-
`tion/multiple data) [15] techniques used in conjunction
`with multithreading or multistreaming have been proposed
`to take advantage of the parallelism inherent in distributed
`functional unit organizations. The VLIW approach is typi-
`fied by Keckler and Dally’s processor coupling mechanism
`[18], adopted in the M-machine. Several statically sched-
`uled VLIW threads are multiplexed at clock cycle granular-
`ity onto the functional units, with the independence of
`VLIW operations being exploited to achieve multiple re-
`sults in each cycle. Even though programs must be com-
`piled to suit a particular hardware configuration, some of the
`scheduling burden is eliminated by allowing the execution of
`
`0018-9340/97/$10.00 © 1997 IEEE
`
`SAMSUNG-1011
`Page 1 of 13
`
`

`

`400
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`thread operations to proceed with limited asynchronicity.
`While the VLIW approach captures instruction-level paral-
`lelism, it does so at the cost of substantial hardware con-
`texts and attendant resource duplication, including multi-
`ple caches, which, ultimately, limit scalability and latency
`tolerance. Hirata et al. proposed a MIMD mode of opera-
`tion for a multithreaded architecture [15]. A number of
`thread slots (instruction queue and sequencer units) can
`dispatch instructions simultaneously to a shared, heteroge-
`neous functional unit set. A central register file is parti-
`tioned into context banks, a subset of which is dynamically
`assigned to the thread slots. Since each thread slot is as-
`signed to a single bank, and context switches occur on
`cache misses only, the more usual functional unit latencies
`remain exposed. The combination of coarse multithreading
`and complex register file access resulted in instruction
`throughputs remaining 40% below peak rates for even
`highly parallel workloads and infinite caches.
`The objectives of this paper are to demonstrate solutions
`to the hardware utilization, latency tolerance, and pro-
`grammability problems that have hampered previous de-
`signs, and to characterize Concurro, a multithreaded ar-
`chitecture possessing distributed functional units [10]. The
`primary design goal for Concurro is achieving a scalable
`microprocessor having the performance characteristics of a
`multiprocessor without sacrificing the utilization and pro-
`gramming advantages of conventional multiple context
`architectures. Multithreading and cost-sensitive resource
`sharing in a distributed functional unit architecture are the
`keys to meeting this goal. Although the distinction between
`uni- and multiprocessing is blurred in Concurro, this paper
`considers the machine in the uniprocessor sense only, and
`as such, it may be regarded as a PE in a larger multiproces-
`sor system.
`In the following section, we will examine the processor
`design space, arriving at the architecture and organization
`of Concurro processors. Section 3 establishes the simulation
`study used in subsequent sections to evaluate behavior and
`design trade-offs. Scalability and organization alternatives
`are assessed in Section 4. Section 5 examines the issues re-
`lating to instruction handling, while Section 6 is concerned
`with cost and performance aspects of the datapath. Sec-
`tion 7 draws together the main results of the study.
`
`2 THE CONCURRO ARCHITECTURE
`Concurro is a multiple context, von Neumann processor
`architecture. A Concurro processor is notionally organized
`as a shared memory multiprocessor, but with the tradi-
`tional PE boundaries dissolved. Any instruction sequence,
`with few exceptions, constitutes a valid thread. This general
`MIMD programming model allows the processor to be op-
`erated effectively in either single threaded or multithreaded
`conditions.
`
`2.1 Instruction Set and Organization
`Concurro adopts a 64-bit, register-oriented, load/store in-
`struction set architecture (ISA) with additional support for
`thread control and fine-grained synchronization. Apart
`from its benefits to optimizing compilers, the register-
`
`oriented ISA greatly reduces the implementation complex-
`ity of multiple context processors, and facilitates the ex-
`ploitation of concurrency at all levels. Each processor con-
`text consists of a program counter and 16 64-bit local, gen-
`eral-purpose registers, with all contexts having access to a
`set of 16 64-bit global registers, one of which is hard-wired
`to zero. The number of local registers is constrained pri-
`marily by instruction encoding limits, but also by imple-
`mentation cost and performance considerations. Sufficient
`bandwidth to the global registers can be achieved without
`excessive multiporting by maintaining a number of consis-
`tent copies of the global register file, each copy serving a
`subset of the contexts. The instruction set includes arithme-
`tic and logical operations on a variety of data types, ranging
`from bytes (8 bits) to double-precision floating point num-
`bers or long words (64 bits). Data may be accessed via an
`on-chip data cache or directly from the pipelined main
`memory, to suit the locality characteristics of particular
`workloads.
`The architecture allows any particular implementation to
`contain a fixed number of processor contexts, each of which
`can be considered a virtual processor, but programmers are
`discouraged from assuming either scheduling policy or
`context numbers in programs. Despite the programming
`difficulties associated with fixed context multithreading,
`such implementations are preferred in the interests of pro-
`moting effective use of the storage hierarchy—especially at
`the register level [5]. At all times, the processor is executing
`the root thread, which gives rise to other threads by activat-
`ing dormant contexts in fork operations. Any thread is per-
`mitted to fork threads of its own, provided that the supply
`of contexts is not exhausted, otherwise, the fork operation
`blocks until a fresh context becomes available. Such block-
`ing, given a finite number of contexts, can potentially cause
`deadlock when fork dependencies arise, thus realistic proc-
`essor implementations would be expected to raise either a
`scheduling or time-out exception to allow software inter-
`vention in this case. Threads other than the root thread are
`permitted to terminate themselves explicitly, rather than
`terminating in join synchronizations, so as to minimize the
`frequency of context destruction and recreation. Inter-
`thread communication and synchronization is effected by
`special register and memory operations.
`A block diagram of a Concurro processor is shown in
`Fig. 1. The processor contexts, labeled “C,” are clustered
`into small groups in order to localize and thus speed up
`resource contention during instruction dispatch. Each con-
`text group can be viewed as the core of a multithreaded PE,
`with some resources being shared among contexts in the
`group. Each context group independently dispatches in-
`structions to virtual functional units connected to the group's
`operand bus. In the case of complex or infrequently used
`functions, the virtual functional units are, in fact, interfaces
`to physical functional units, which are shared by a subset of
`the context groups as indicated in Fig. 2. Operand queues—
`or simple latches—on the virtual functional unit inputs de-
`couple instruction dispatch, by context groups, from in-
`struction issue, by the fewer physical functional units.
`Physical functional units are generally pipelined, capable of
`accepting a new pair of 64-bit operands and an operation
`
`SAMSUNG-1011
`Page 2 of 13
`
`

`

`GUNTHER: MULTITHREADING WITH DISTRIBUTED FUNCTIONAL UNITS
`
`401
`
`code in every cycle. Routing of results is facilitated by tag-
`ging each instruction with the context identifier and register
`number of its destination. Results are delivered to their
`destination register files via a group result multiplexor and
`result bus at the maximum rate of one long word per cycle
`per group.
`The instruction cache subsystem consists of an on-chip
`primary cache backed by an external secondary cache of
`substantially larger capacity. Concurro’s data access in-
`structions are supported by load and store buffers con-
`necting main memory, an on-chip data cache, and a syn-
`
`Fig. 1. Concurro processor block diagram.
`
`Fig. 2. Virtual functional unit interface to physical functional units.
`
`chronization controller. In order to isolate the processor's
`performance from the implementation of main memory, the
`simulation study in this paper assumes the main memory
`to be perfectly pipelined and capable of supplying one long
`word per cycle per memory port. Such memory character-
`istics can be approached by careful interleaving and access
`buffering [28], but the provision of adequate memory
`bandwidth will remain difficult in high performance ma-
`chines of all kinds.
`
`2.2 Processor Contexts
`Context units maintain the processor contexts and control the
`sequencing of threads. Each unit consists of a triple-ported
`register file, program counter, instruction buffer and prede-
`coder, and control logic. Following a fork instruction, the
`processor’s thread control block will place a thread starting
`address on a control bus, and enable a previously inactive
`context; initially, only a context’s program counter is defined.
`Load balancing across context groups is achieved by regis-
`tering the number of active contexts in each group, and ena-
`bling any new context in the group containing the fewest
`active members. Once enabled, a context will continue to
`execute instructions until the running thread terminates it-
`self, or until the operating system terminates the thread pre-
`emptively. Aside from the placement of synchronization
`points, software can exercise no control over the scheduling
`of threads following their creation.
`Since a large number of context units can be expected in
`a high performance machine, an inexpensive instruction
`scheduling mechanism is normally employed, although
`Section 5 considers more sophisticated alternatives. In-
`struction dependencies are enforced through a Cray-style
`register scoreboard [16]. The instruction at the head of the
`instruction buffer is predecoded to determine resource re-
`quirements, and if the instruction is free of data and struc-
`tural hazards (at the virtual functional unit level), the con-
`text unit contends for instruction dispatch in the current
`cycle. Thus, threads can switch between blocked and run-
`ning conditions on a cycle-by-cycle basis as instruction de-
`pendencies allow. A fair arbiter schedules asynchronously
`one of the ready instructions in the group for dispatch.
`Since blocked or absent threads do not contend for dis-
`patch, the processor is able to maximize instruction
`throughput under conditions of low program parallelism.
`This capability is further enhanced by the provision of a
`bypass around the register write back stage.
`A schematic of the processor pipeline appears in Fig. 3, for
`the case of a two-cycle execution latency. Owing to buffering
`in the contexts, the instruction cache access stage is normally
`not entered in every cycle; subsequent pipestages, though,
`are entered by every executed instruction. The number of
`execution stages is variable, depending on the required
`functional unit and its activity. However, all functional
`units contend and announce result write back one-half cy-
`cle prior to completing an operation in order to give context
`units the opportunity for bypassing the register write back
`stage in the following cycle.
`As each context unit (or context group, at least) may
`contain a branch unit, it is desirable to minimize the cost of
`these units. Accordingly, Concurro’s branch and fork targets
`
`SAMSUNG-1011
`Page 3 of 13
`
`

`

`402
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`sets n and n + 1 simultaneously can be avoided simply by
`enabling line-crossing accesses to a Random Pre-access
`Memory (RPM) instruction store [9]. A RPM accesses, with
`wrap-around, a contiguous block of words starting at any
`address, therefore only words W3-W6 are accessed and re-
`turned from a RPM-based cache. The cache controller at-
`tempts to allocate adjacent cache lines to contiguous mem-
`ory addresses, and marks tags of running lines to indicate
`whether words accessed from the next set belong to the
`requested prefetch block.
`
`Fig. 3. Instruction pipeline.
`
`are specified in semi-absolute form: a low-order segment of
`the absolute address coupled with a relative direction flag
`to provide position-independence [8]. Semi-absolute branch
`units offer speed and area advantages, since they can be
`constructed without address adders. A static branch pre-
`diction scheme—anticipating that only backward branches
`will be taken—minimizes average branch latencies in the
`absence of branch delay slots.
`
`2.3 Cache Subsystems
`Caches remain useful even in a multithreaded processor,
`particularly where high reference locality can be relied
`upon or program parallelism is insufficient for effective
`latency hiding [21]. To maximize cache throughputs in the
`presence of multiple threads, however, both the instruction
`and data caches are nonblocking [19], allowing several
`missed accesses to be queued and serviced while hit ac-
`cesses take place.
`Satisfying the instruction bandwidth requirements poses
`a challenge in terms of containing implementation cost.
`This paper compares two extreme approaches—a single
`primary instruction cache and a private I-cache per context
`group. The multiple I-cache option, although the most
`costly, is straight-forward in design and guarantees the de-
`livery of a fresh instruction for every one dispatched by a
`group. The independence of fetches across different threads
`complicates the single I-cache option. To avoid multiport-
`ing the cache (and replicating controller logic), blocks of
`instructions can be prefetched in parallel from a single I-
`cache, onto which prefetch requests from depleted instruc-
`tion buffers are multiplexed. Adequate bandwidth is
`achieved by buffering as many prefetched instructions as
`can be dispatched by the entire machine in one cycle, plus
`an additional 25% to compensate for loss on jumps.
`A major drawback of parallel prefetching, however, is
`the possibility of degraded cache returns caused by una-
`ligned accesses, as illustrated in Fig. 4. When the block of
`four instruction words, W3-W6, is requested from a normal
`cache, the appearance of the cache line boundary truncates
`the returned block to the single word W3, even though the
`remainder of the block is likely to be present in the cache.
`The persistence of inadequate cache returns would ulti-
`mately starve the context units and lower performance.
`Forcing address alignment by padding basic blocks with
`no-op instructions, or dual-porting the cache to access line
`
`Fig. 4. Using line crossing to improve parallel prefetching performance
`with unaligned addresses.
`
`The bandwidth requirement for loads and stores is less
`severe than that for instruction fetches, allowing a more
`conventional organization of the data cache. Experimental
`results suggest that one 64-bit D-cache port for every four
`context groups is normally adequate; in the case of multiple
`ports, the address scattering caused by many threads can be
`exploited in a banked organization less expensive than
`genuine multiporting. The D-cache is backed directly by the
`main memory. Cache performance under conditions of
`heavy miss rates and long memory latency is enhanced by
`splitting miss servicing into separate request and reload
`phases, and pipelining the requests sent to main memory.
`For unit-stride access patterns this scheme delivers the
`greatest part of the main memory bandwidth to mul-
`tithreaded applications operating out of cache. Neverthe-
`less, degraded locality caused by multiple threads in com-
`petition may still force a compiler to adopt (or be instructed
`to adopt) uncached loads and stores.
`
`2.4 Synchronization Facilities
`Synchronization delays are often long and unpredictable,
`therefore, in many multithreaded machines, busy waiting
`after a synchronization is avoided by suspending the syn-
`chronizing thread—removing it from the scheduling pool—
`and scheduling a context switch. While this strategy may be
`effective for coarse-grained programs, for fine-grained
`processing a regular context switch may be cheaper than
`suspending a thread when synchronizations can be ex-
`pected to complete within several scheduling quanta [2]. In
`Concurro, the fine-grained approach is embraced, with
`
`SAMSUNG-1011
`Page 4 of 13
`
`

`

`GUNTHER: MULTITHREADING WITH DISTRIBUTED FUNCTIONAL UNITS
`
`403
`
`producing accurate results with regard to cycle counts. To
`this end, fixed time-step, discrete-event simulation was per-
`formed in such a way that processor components critical to
`timing accuracy—bus switching and pipeline interlock, for
`instance—were simulated at register transfer level, whereas
`a behavioral simulation sufficed for components having
`parameterized latency—such as the functional units. The
`pipeline diagram in Fig. 3 indicates, in terms of cycle peri-
`ods, the delays assigned to the processor pipestages, with
`Table 1 listing the latency parameters for the execution
`stages and memory.
`
`both inter and intrathread synchronization being enforced
`via the register scoreboard mechanism; this enables a thread
`to continue execution past some synchronization instructions
`until it reaches its actual synchronization point, beyond
`which the asynchronous scheduler will ignore the thread
`while its synchronization completes. Two levels of synchro-
`nized data communications are available to programmers:
`register channel transfers [11], and on a larger namespace, I-
`structure [3], and M-structure [4] memory operations.
`Thirty-two register channels, each of which is a 64-bit reg-
`ister with an associated presence bit, are accessible through
`explicit read and write instructions, whose dispatch may be
`interlocked by the presence bit of a named channel. Threads
`can block on either channel writes or reads, depending on
`whether previous data had been consumed or fresh data
`written, respectively. The limited supply of register channels
`necessitates that they be allocated, like the general-purpose
`registers, with the objective of re-use. Accordingly, register
`channels are generally confined to thread parameter passing
`and synchronization in small thread ensembles.
`The synchronization controller, indicated in Fig. 1, is a
`microcoded control unit that cooperates with the D-cache to
`realize an I-structure/M-structure store. Synchronization
`can occur at the level of individual long word locations in
`the structure store, each of which appears either as empty
`or, when holding valid data, as full. Instead of dedicating a
`special tagged memory to the structure store, the D-cache
`holds structures, and their states are maintained through
`software convention. A non-full structure is split into two
`32-bit fields—a pointer field and a state field in the most
`significant bits. The contents of the state field are restricted
`to a set of values such that no floating point number or cor-
`rectly sign-extended word-sized integer generated by the
`processor can be interpreted as empty structure state. A
`structure is cleared by writing the empty value to its state
`field using an ordinary cached store instruction. The syn-
`chronization controller defers loads from empty structures
`by queuing the requests at the structure, with the common
`case of a single deferred load being optimized by storing
`the 10-bit result destination tag in the structure’s pointer
`field and setting the state field accordingly. A system reg-
`ister points to a free list of long words in the user's address
`space from which nodes (each a 32-bit link pointer and 10-
`bit tag) are obtained for deferred lists. A synchronizing
`store to an I-structure causes all deferred loads to be com-
`pleted before the structure is updated to full, whereas for an
`M-structure mutual exclusion is achieved by permitting
`only a single deferred load to complete, which in turn resets
`the structure to empty. Filled I-structures are read nonde-
`structively, with negligible overhead.
`
`3 EXPERIMENTAL METHOD
`3.1 Simulation Procedure
`Concurro processors and alternative organizations were
`evaluated through execution-driven simulation. A recon-
`figurable simulator, written in C, accepted executable bi-
`nary files and generated execution statistics and traces di-
`rectly under control of a script. The processor simulator
`was designed to simulate at the level of detail necessary for
`
`TABLE 1
`SOME BASELINE PROCESSOR CONFIGURATIONS
`Configuration
`4G16C
`4
`16
`25 cycles
`16/256 kB
`5 words
`32 kB
`1
`2
`1 cycle
`2
`1 cycle
`1
`1 cycle
`1
`4 cycles
`1
`5 cycles
`18 cycles
`
`Parameter
`Context groups
`Maximum contexts
`Main memory latency
`I-cache size (on/off chip)
`Instruction preaccess
`D-cache size
`Memory, D-cache ports
`Logical units
`Logical latency
`Integer arithmetic units
`Integer word-add latency
`Shift units
`Shift latency
`F.P. add units
`F.P. add latency
`F.P. multiply units
`F.P. multiply latency
`F.P. divide latency
`
`2G8C
`2
`8
`25 cycles
`16/256 kB
`3 words
`32 kB
`1
`1
`1 cycle
`1
`1 cycle
`1
`1 cycle
`1
`4 cycles
`1
`5 cycles
`18 cycles
`
`8G32C
`8
`32
`25 cycles
`16/256 kB
`10 words
`32 kB
`2
`4
`1 cycle
`4
`1 cycle
`2
`1 cycle
`2
`4 cycles
`2
`5 cycles
`18 cycles
`
`All transfers of data onto buses and through pipeline
`registers were coordinated by a global two-phase simula-
`tion clock: With phase 2 active, data was normally set up to
`be valid by the end of the phase, and then latched during
`phase 1 of the cycle. This strategy not only resolved read-
`after-write dependencies throughout the simulator, but also
`served to catch potential errors associated with structural
`hazards and cycle time limitations. In the case of contention
`for shared resources, round-robin arbiters selected candi-
`dates asynchronously (through combinational logic) during
`the whole of phase 2, with choices determining subsequent
`arbiter states in phase 1. Grouping of contexts and instruc-
`tions limited arbitration complexity to one-of-eight decisions.
`Ports of the main memory were accessed via split ad-
`dress/data buses, allowing data from previous reads to be
`returned simultaneously with fresh read requests on every
`cycle. Signaling of bus activity enabled read data to be
`latched into load silos, as well as permitting writes on data
`bus cycles during which no read data was present. It was
`assumed that the memory controller would process ac-
`cesses in the order they were received, although the
`load/store functional unit reordered processor accesses for
`the purpose of maximizing utilization of the split buses.
`Processor configurations are primarily governed by the
`datapath bandwidth required to sustain throughput under
`parallel operation. Thus the number of distributed func-
`tional units and memory ports must increase above the
`architectural minimum when multiple context groups are
`
`SAMSUNG-1011
`Page 5 of 13
`
`

`

`404
`
`IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 4, APRIL 1997
`
`implemented. Table 1 indicates the salient parameters of
`three configurations used as baseline microarchitectures in
`the following sections. These configurations are intended to
`be consistent with scalability at best cost, and therefore rep-
`resent neither maximum performance nor optimum utili-
`zation points. Memory and functional unit latencies were
`based on a processor cycle time of 4 ns. The scaling of dis-
`tributed functional units was in most cases derived from
`experimental measures of instruction frequencies, rather
`than estimates. Although peak bandwidth demands can
`temporarily exceed the available bandwidth from certain
`functional units, stalls are avoided through operand queues
`on virtual functional units (Fig. 2), each queue buffering up
`to Ø G/Fø instructions, where G is the number of context
`groups and F is the number of physical functional units
`belonging to a given instruction class.
`For each benchmark, a single executable file was pro-
`duced and run without modification on all processor con-
`figurations. Binary compatibility was achieved by making
`few specific assumptions on context availability, and en-
`suring that thread fork overheads were adequately covered
`by computational work. The simulation results presented in
`later sections are normalized against the performance of the
`baseline configurations in Table 1.
`
`3.2 Benchmark Programs
`Six benchmark programs, consisting of small, realistic ap-
`plications and synthetic benchmark programs, were used
`for the experimental evaluation. The programs were com-
`piled manually into assembly code—a process which limited
`the practical complexity of the benchmarks, but allowed
`more aggressive optimizations. When architecturally similar
`machines are compared, the effects of manual compilation
`are largely factored out. The most low-level aspects of pro-
`gramming, such as basic block instruction scheduling and
`optimization of instruction selection, were automated in the
`assembler. Each benchmark is described briefly below:
`• carries—Highly parallel integer search that updates a
`M-structure histogram with the carry run lengths for
`all possible 16-bit binary additions; carries was
`adapted from an original C source, with the main
`loop body coded as a thread.
`• compress—Single threaded implementation of the
`UNIX “compress” utility, compressing a 22 kB file
`using adaptive Lempel-Ziv coding; compress was di-
`rectly translated from its C source.
`• gauss—Solution of a 200 · 200 system of linear equa-
`tions through Gaussian elimination with partial-
`pivoting and back-substitution; gauss was manually
`coded in assembly language to take advantage of M-
`structures for locking matrix rows, and I-structures in
`the parallel back-substitution phase.
`• loops—Loops 1, 3, 5, 6, 7, 10, 11, and 12 of the Liver-
`more FORTRAN Kernels [22] interspersed by barrier
`synchronizations; loops was translated from FOR-
`TRAN source, and exploited I-structures in parallel-
`ized recurrence relations.
`• matmult—Multiplication of two 100 ·
`100 double-
`precision matrices, producing a separate product matrix;
`
`Fig. 5. Dynamic instruction mix of the benchmark programs.
`
`matmult was manually coded, using a blocked algorithm
`with 2 · 2 tiles.
`• tf—Text formatter that reads a 22 kB file and outputs
`the text as fully justified paragraphs; the C source for
`tf was recoded as a pipeline of four stages across four
`threads, each thread communicating via FIFO data
`structures.
`Table 2 lists the principal characteristics of the benchmarks.
`The dynamic thread counts indicate that all benchmarks ex-
`cept compress take advantage of multithreading; the average
`numbers of instructions executed between potentially suspen-
`sive synchronizations give rise to the mean thread run lengths
`shown. Cache and main memory activities for configuration
`8G32C are included to illustrate that direct access to main
`memory successfully averted data cache bottlenecks (at most
`two cache accesses per cycle are available in 8G32C). Fig. 5
`breaks down the dynamic instruction counts by operation
`type. Programs carries, compress, and tf represent integer
`workloads, characterized by cached data access and short ba-
`sic blocks. Scientific computation is represented by gauss, loops,
`and matmult, which are substantially vectorizable, floating
`point codes that bypass the data cache for vector and matrix
`access. For all programs, parallel loops are unrolled a maxi-
`mum of four iterations, and the statically rescheduled loop
`bodies are formed into threads for forking by a control thr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket