throbber
The Case for a Single-Chip Multiprocessor
`
`Kunle Olukotun, BasemA. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang
`
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305-4070
`http://www-hydra.stanford.edu
`
`Abstract
`
`Advances in IC processing allow for more microprocessor design
`options. The increasing gate density and cost of wires in advanced
`integrated circuit technologies require that we look for new ways to
`use their capabilities effectively. This paper shows that in advanced
`technologies it is possible to implement a single-chip multiproces(cid:173)
`sor in the same area as a wide issue superscalar processor. We find
`that for applications with little parallelism the performance of the
`two microarchitectures is comparable. For applications with large
`amounts of parallelism at both the fine and coarse grained levels,
`the multiprocessor microarchitecture outperforms the superscalar
`architecture by a significant margin. Single-chip multiprocessor
`architectures have the advantage in that they offer localized imple(cid:173)
`mentation of a high-clock rate processor for inherently sequential
`applications and low latency interprocessor communication for par(cid:173)
`allel applications.
`
`1
`
`Introduction
`
`Advances in integrated circuit technology have fueled microproces(cid:173)
`sor performance growth for the last fifteen years. Each increase in
`integration density allows for higher clock rates and offers new
`opportunities for microarchitectural innovation. Both of these are
`required to maintain microprocessor performance growth. Microar(cid:173)
`chitectural
`innovations employed by recent microprocessors
`include multiple instruction issue, dynamic scheduling, speculative
`execution and non-blocking caches. In the future, the trend seems to
`be towards CPUs with wider instruction issue and support for larger
`amounts of speculative execution. In this paper, we argue against
`this trend. We show that, due to fundamental circuit limitations and
`limited amounts of instruction level parallelism, the superscalar
`execution model will provide diminishing returns in performance
`for increasing issue width. Faced with this situation, building a
`complex wide issue superscalar CPU is not the most efficient use of
`silicon resources. We present the case that a better use of silicon
`area is a multiprocessor microarchitecture constructed from simpler
`processors.
`
`Permission to make digita.ll1lard copy 01 part .or all 01 this \'york lor personal
`Of classroom use is granted without lee proVided that COPI~S are n~t made
`or distributed lor profit or commercial advantage, tt:Je ~p~nght notice, the
`title 01 the publication and its date appear, and notice !S given that .
`copying is by permission 01 ACM, Inc. To cop¥ other:wlse, to. republJs~, !O
`post on servers, or to redistribute to lists, requires poor spectfic permission
`and/or a lee.
`
`ASPLOS VII 10196 MA, USA
`C 1996 ACM 0-89791-767-7/96/0010...$3.50
`
`2
`
`To understand the performance trade-offs between wide-issue pro(cid:173)
`cessors and multiprocessors in a more quantitative way, we com(cid:173)
`pare the performance of a six-issue dynamically scheduled
`superscalar processor with a 4 x two-issue multiprocessor. Our
`comparison has a number of unique features. First, we accurately
`account for and justify the latencies, especially the cache hit time,
`associated with the two microarchitectures. Second, we develop
`floor-plans and carefully allocate resources to the two microarchi(cid:173)
`tectures so that they require an equal amount of die area. Third, we
`evaluate these architectures with a variety of integer, floating point
`and multiprogramming applications running in a realistic operating
`system environment.
`
`The results show that on applications that cannot be parallelized,
`the superscalar microarchitecture performs 30% better than one
`processor of the multiprocessor architecture. On applications with
`fine grained thread-level parallelism the multiprocessor microarchi(cid:173)
`tecture can exploit this parallelism so that the superscalar microar(cid:173)
`chitecture is at most 10% better. On applications with large grained
`thread-level parallelism and multiprogramming workloads the mul(cid:173)
`tiprocessor microarchitecture performs 50-100% better than the
`wide superscalar microarchitecture.
`
`The remainder of this paper is organized as follows. In Section 2,
`we discuss the performance limits of superscalar design from a
`technology and implementation perspective. In Section 3, we make
`the case for a single chip multiprocessor from an applications per(cid:173)
`spective. In Section 4, we develop floor plans for a six-issue super(cid:173)
`scalar microarchitecture and a 4 x two-issue multiprocessor and
`examine their area requirements. We describe the simulation meth(cid:173)
`odology used to compare these two microarchitectures in Section 5,
`and in Section 6 we present the results of our performance compar(cid:173)
`ison. Finally, we conclude in Section 7.
`
`2
`
`The Limits of the Superscalar Approach
`
`A recent trend in the microprocessor industry has been the design
`of CPUs with multiple instruction issue and the ability to execute
`instructions out of program order. This ability, called dynamic
`scheduling, first appeared in the CDC 6600 [21]. Dynamic schedul(cid:173)
`ing uses hardware to track register dependencies between instruc(cid:173)
`tions; an instruction is executed, possibly out of program order, as
`soon as all of its dependencies are satisfied. In the CDC 6600 the
`register dependency checking was done with a hardware structure
`called the scoreboard. The IBM 360/91 used register renaming to
`improve the efficiency of dynamic scheduling using hardware struc-
`
`AMD EX1022
`U.S. Patent No. 6,895,519
`
`0001
`
`

`

`Figure 1. A dynamic superscalar CPU tures called reservation stations [3]. It is possible to design a dynamically scheduled superscalar microprocessor using reserva- tion stations; Johnson gives a thorough description of this approach [13]. However, the most recent implementations of dynamic super- scalar processors have used a structure similar to the one shown in Figure 1. Here register renaming between architectural and physical registers is done explicitly, and instruction scheduling and register dependency tracking between instructions are performed in an instruction issue queue. Examples of microprocessors designed in this manner are the MIPS Technologies R10000 [24] and the HP PA-8000 [14]. In these processors the instruction queue is actually implemented as multiple instruction queues for different classes of instructions (e.g. integer, floating point, load/store). The three major phases of instruction execution in a dynamic superscalar machine are also shown in Figure 1. They are fetch, issue and execute. In the rest of this section we describe these phases and the limitations that will arise in the design of a very wide instruction issue CPU. The goal of the fetch phase is to present the rest of the CPU with a large and accurate window of decoded instructions. Three factors constrain instruction fetch: mispredicted branches, instruction mis- alignment, and cache misses. The ability to predict branches cor- rectly is crucial to establishing a large, accurate window of instructions. Fortunately, by using a moderate amount of memory (64Kbit), branch predictors such as the selective branch predictor proposed by McFarling are able to reduce misprediction rates to under 5% for most programs [15]. However, good branch predic- tion is not enough. As Conte pointed out, it is also necessary to align a packet of instructions for the decoder [7]. When the issue width is wider than four instructions there is a high probability that it will be necessary to fetch across a branch for a single packet of instructions since, in integer programs, one in every five instruc- tions is a branch [12]. This will require fetching from two cache lines at once and merging the cache lines together to form a single packet of instructions. Conte describes a number of methods for achieving this. A technique that divides the instruction cache into banks and fetches from multiple banks at once is not too expensive to implement and provides performance that is within 3% of a per- fect scheme on an 8-wide issue machine. Even with good branch prediction and alignment a significant cache miss rate will limit the ability of the fetcher to maintain an adequate window of instruc- tions. There are still some applications such as large logic simula- tions, transactions processing and the OS kernel that have significant instruction cache miss rates even with fairly large 64 KB two way set-associative caches [19]. Fortunately, it is possible to hide some of the instruction cache miss latency in a dynamically scheduled processor by executing instructions that are already in the instruction window. Rosenblum et. al. have shown that over 60% of the instruction cache miss latency can be hidden on a data- base benchmark with a 64KB two way set associative instruction cache [ 19]. Given good branch prediction and instruction alignment it is likely that the fetch phase of a wide-issue dynamic superscaiar processor will not limit performance. In the issue phase, a packet of renamed instructions is inserted into the instruction issue queue. An instruction is issued for execution once all of its operands are ready. There are two ways to implement renaming. One could use an explicit table for mapping architectural registers to physical registers, this scheme is used in the R10000 [24], or one could use a combination reorder buffer/instruction queue as in the PA-8000 [14]. The advantage of the mapping table is that no comparisons are required for register renaming. The dis- advantage of the mapping table is that the number of access ports required by the mapping table structure is O x W, where O is the number of operands per instruction and W is the issue width of the machine. An eight-wide issue machine with three operands per instruction requires a 24 port mapping table. Implementing renam- ing with a reorder buffer has its own set of drawbacks. It requires n x Q x O x W 1-bit comparators to determine which physical reg- isters should supply operands for a new packet of instructions, where n is the number of bits required to encode a register identi- fier and Q is the size of the instruction issue queue. Clearly, the number of comparators grows with the size of the instruction queue and issue width. Once an instruction is in the instruction queue, all instructions that issue must update their dependencies. This requires another set of n x Q x O x w comparators. For example, a machine with eight wide issue, three operand instructions, a 64- entry instruction queue, and 6-bit comparisons requires 9,216 1-bit comparators. The net effect of all the comparison logic and encod- ing associated with the instruction issue queue is that it takes a large amount of area to implement. On the PA-8000, which is a four- issue machine with 56 instruction issue queue entries, the instruc- tion issue queue takes up 20% of the die area. In addition, as issue widths increase, larger windows of instructions are required to find independent instructions that can issue in parallel and maintain the full issue bandwidth. The result is a quadratic increase in the size of the instruction issue queue. Moving to the circuit level, the instruc- tion issue queue uses a broadcast mechanism to communicate the tags of the instructions that are issued, which requires wires that span the length of the structure. In future advanced integrated cir- cuit technologies these wires will have increasingly long delays rel- ative to the gates that drive them [9]. Given this situation, ultimately, the instruction issue queue will limit the cycle time of the processor. For these reasons we believe that the instruction issue
`
`0002
`
`

`

`queue will fundamentally limit the performance of wide issue superscalar machines. In the execution phase, operand values are fetched from the register file or bypassed from earlier instructions to execute on the func- tional units. The wide superscalar execution model will encounter performance limits in the register file, in the bypass logic and in the functional units. Wider instruction issue requires a larger window of instructions, which implies more register renaming. Not only must the register file be larger to accommodate more renamed registers, but the number of ports required to satisfy the full instruction issue bandwidth also grows with issue width. Again, this causes a qua- dratic increase in the complexity of the register file with increases in issue width. Farkas et. al. have investigated the effect of register file complexity on performance [10]. They find that an eight-issue machine only performs 20% better than a four-issue machine when the effect of cycle-time is included in the performance estimates. The complexity of the bypass logic also grows quadratically with number of execution units; however, a more limiting factor is the delay of the wires that interconnect the execution units. As far as the execution units themselves are concerned, the arithmetic func- tional units can be duplicated to support the issue width, but more ports must be added to the primary data cache to provide the neces- sary load/store bandwidth. The cheapest way to add ports to the data cache is by building a banked cache [20], but the added multi- plexing and control required to implement a banked cache increases the access time of the cache. We investigate this issue in more detail in Section 4.2. 3 The Case for a Single-Chip Multiprocessor The motivation for building a single chip multiprocessor comes from two sources; there is a technology push and an application pull. We have already argued that technology issues, especially the delay of the complex issue queue and multi-port register files, will limit the performance returns from a wide superscalar execution model. This motivates the need for a decentralized microarchitec- ture to maintain the performance growth of microprocessors. From the applications perspective, the microarchitecture that works best depends on the amount and characteristics of the parallelism in the applications. Wall has performed one of the most comprehensive studies of application parallelism [22]. The results of his study indicate that applications fall in two classes. The first class consists of applica- tions with low to moderate amounts of parallelism; under ten instructions per cycle with aggressive branch prediction and large, but not infinite window sizes. Most of these applications are integer applications. The second class consists of applications with large amounts of parallelism, greater than forty instructions per cycle with aggressive branch prediction and large window sizes. The majority of these applications are floating point applications and most of the parallelism is in the form of loop-level parallelism. The application pull towards a single-chip multiprocessor arises because these two classes of applications require different execu- tion models. Applications in the first class work best on processors that are moderately superscalar (2 issue) with very high clock rates because there is little parallelism to exploit. To make this more con- crete we note that a 200 MHz MIPS R5000, which is a single issue machine when running integer programs, achieves a SPEC9.5 inte- ger rating which is 70% of the rating of a 200 MHz MIPS R10000, which is a four-issue machine [6], Both machines have the same size data and instruction caches, but the R5000 has a blocking data cache, while the R10000 has a non-blocking data cache. Applica- tions in the second class have large amounts of parallelism and see performance benefits from a variety of methods designed to exploit parallelism such as sUperscalar, VLIW or vector processing. How- ever, the recent advances in parallel compilers make a multiproces- sor an efficient and flexible way to exploit the parallelism in these programs [1]. Single-chip multiprocessors, designed so that the individual processors are simple and achieve very high clock rates, will work well on integer programs in the first class. The addition of low latency communication between processors on the same chip also allows the multiprocessor to exploit the parallelism of the float- ing point programs in the second class. In Section 6 we evaluate the performance of a single-chip multiprocessor for these two application classes. There are a number of ways to use a multiprocessor. Today, the most common use is to execute multiple processes in parallel to increase throughput in a multiprogramming environment under the control of a multiprocessor aware operating system. We note that there are a number of commercially available operating systems that have this capability (e.g. Silicon Graphics IRIX, Sun Solaris, Microsoft Windows NT). Furthermore, the increasingly widespread use of visualization and multimedia applications tends to increase the number of active processes or independent threads on a desktop machine or server at a particular point in time. Another way to use a multiprocessor is to execute multiple threads in parallel that come from a single application. Two examples are transaction processing and hand parallelized floating point scien- tific applications [23]. In this case the threads communicate using shared memory, and these applications are designed to run on paral- lel machines with communication latencies in the hundreds of CPU clock cycles; therefore, the threads do not communicate in a very fine grained manner. Another example of manually parallelized applications are fine-grained thread-level integer applications. Using the results from Wall's study, these applications exhibit mod- erate amounts of parallelism when the instruction window size is very large and the branch prediction is perfect because the parallel- ism that exists is widely distributed. Due to the large window size and the perfect branch prediction it will be very difficult for this parallelism could be extracted with a superscalar execution model. However, it is possible for a programmer that understands the nature of the parallelism in the application to parallelize the appli- cation into multiple threads. The parallelism exposed in this manner is fine-grained and cannot be exploited by a conventional multipro- cessor architecture. The only way to exploit this type of parallelism is with a single-chip multiprocessor architecture. A third way to use a multiprocessor is to accelerate the execution of sequential applications without manual intervention; this requires automatic parallelization technology. Recently, this automatic par- allelization technology was shown to be effective on scientific applications [2], but it is not yet ready for general purpose integer applications. Like the manually parallelized integer applications, these applications could derive significant performance benefits from the low-latency interprocessor communication provided by a single-chip multiprocessor. 4
`
`0003
`
`

`

`6-way SS 4x2-way MP # of CPUs 1 4 Degree superscalar 6 4 x 2 # of architectural registers 32int / 32fp 4 x 32int / 32fp # of physical registers 160int / 160fp 4 x 40int / 40fp # of integer functional units 3 4 x 1 # of floating pt. functional units 3 4 x 1 # of load/store ports 8 (one per bank) 4 x 1 BTB size 2048 entries 4 x 512 entries Return stack size 32 entries 4 x g entries Instruction issue queue size 128 entries 4 x 8 entries I cache 32 KB, 2-way S.A. 4 x 8 KB, 2-way S. A. D cache 32 KB, 2-way S.A. 4 x 8 KB, 2-way S. A. LI hit time 2 cycles (4 ns) 1 cycle (2 ns) LI cache interleaving 8 banks N/A Unified L2 cache 256 KB, 2-way S.A. 256 KB, 2-way S. A. L2 hit time / LI penalty 4 cycles (8 ns) 5 cycles (10 ns) Memory latency / L2 penalty 50 cycles (100 ns) 50 cycles (100 ns) Table 1. Key characteristics of the two microarchitectures 4 Two Microarchitectures To compare the wide superscalar and multiprocessor design approaches, we have developed the microarchitectures for two machines that will represent the state of the art in processor design a few years from now. The superscalar microarchitecture (SS) is a logical extension of the current R10000 superscalar design, wid- ened from the current four-way issue to a six-way issue implemen- tation. The multiprocessor microarchitecture (MP), is a four-way single-chip multiprocessor composed of four identical 2-way super- scalar processors. In order to fit four identical processors on a die of the same size, each individual processor is comparable to the Alpha 21064, which became available in 1992 [8]. These two extremely different microarchitectures have nearly iden- tical die sizes when built in identical process technologies. The pro- cessor size we select is based upon the kinds of processor chips that advances in silicon processing technology will allow in the next few years. When manufactured in a 0.25 I.tm process, which should be possible by the end of 1997, each of the chips will have an area of 430 mm 2 -- about 30% larger than leading-edge microprocessors being shipped today. This represents typical die size growth over the course of a few years among the largest, fastest microprocessors [11]. We have argued that the simpler two-issue CPU used in the multi- processor microarchitecture will have a higher clock rate than the six issue CPU; however, for the purposes of this comparison we have assumed that the two processors have the same clock rate. To achieve the same clock rate the wide superscalar architecture would require deeper pipelining due to the large amount of instruction issue logic in the critical path. For simplicity, we ignore latency variations between the architectures due to the degree of pipelining. We assume the clock frequency of both machines is 500 MHz. At 500 MHz the main memory latencies experienced by the processor are large. We have modeled the main memory as a 50-cycle, 100 ns delay for both architectures, typical values in a workstation today with 60 ns DRAMs and 40 ns of delays due to buffering in the DRAM controller chips [25]. Table 1 shows the key characteristics of the two architectures. We explain and justify these characteristics in the following sections. The integer and floating point functional unit result and repeat latencies are the same as the R10000 [24] 4.1 6-Way Supersealar Architecture The 6-way superscalar architecture is a logical extension of the cur- rent R10000 design. As the floorplan in Figure 2 and the area break- down in Table 2 indicate, the logic necessary for out-of-order instruction issue and scheduling dominates the area of the chip, due to the quadratic area impact of supporting 6-way instruction issue. First, we increased the number of ports in the instruction buffers by 50% to support 6-way issue instead of 4-way, increasing the area of each buffer by about 30-40%. Second, we increased the number of instruction buffers from 48 to 128 entries so that the processor examines a larger window of instructions for ILP to keep the execu- tion units busy. This large instruction window also compensates for the fact that the simulations do not execute code that is optimized for a 6-way superscalar machine. The larger instruction window size and wider issue width causes a quadratic area increase of the instruction sequencing logic to 3-4 times its original size. Alto- gether, the logic necessary to handle out-of-order instruction issue occupies about 120 mm 2 -- about 30% of the die. In comparison, the actual execution units only occupy about 70 mm 2 --just 18% of the die is required to build triple R10000 execution units in a 0.25 gtm process. Due to the increased rate at which instructions are issued, we also enhanced the fetch logic by increasing the size of the branch target buffer to 2048 entries and the call-retum stack to 32 entries. This increases the branch prediction accuracy of the processor and pre-
`
`0004
`
`

`

`21 mm '10 12. t-- ¢J O 21 mm External Instruction Interface Fetch Inst. Decode & Rename Instruction Cache (32 KB) TLB Data Cache (32 KB) Reorder Buffer Instruction Queues, and Out-of-Order Logic Floating Point Unit "2 O) O A rO v <,(3 tO 03 ¢-. ¢D ¢M ..J ¢- E O Figure 2. Floorplan for the six-issue dynamic superscalar microprocessor. vents the instruction fetch mechanism from becoming a bottleneck since the 6-way execution engine requires a much higher instruc- tion fetch bandwidth than the 2-way processors used in the MP architecture. The on-chip memory hierarchy is similar to the Alpha 21164 -- a small, fast level one (L1) cache backed up by a large on-chip level two (L2) cache. The wide issue width requires the L1 cache to sup- port wide instruction fetches from the instruction cache and multi- ple loads from the data cache during each cycle. The two-way set associative 32 KB L1 data cache is banked eight ways into eight small, single-ported, independent 4 KB cache banks each of which handling one access every 2 ns processor cycle. However, the addi- tional overhead of the bank control logic and crossbar required to arbitrate between the multiple requests sharing the 8 data cache banks adds another cycle to the latency of the L1 cache, and increases the area by 25%. Therefore, our modeled L1 cache has a hit time of 2 cycles. Backing up the 32 KB L1 caches is a large, uni- fied, 256 KB L2 cache that takes 4 cycles to access. These latencies are simple extensions of the times obtained for the L1 caches of current Alpha microprocessors [4], using a 0.25 I.tm process tech- nology 4 x 2-way Superscalar Multiprocessor Architecture 4.2 The MP architecture is made up of four 2-way superscalar proces- sors interconnected by a crossbar that allows the processors to share the L2 cache. On the die, the four processors are arranged in a grid with the L2 cache at one end, as shown in Figure 3. Internally, each of the processors has a register renaming buffer that is much more limited than the one in the 6-way architecture, since each CPU only has an 8-entry instruction buffer. We also quartered the size of the branch prediction mechanisms in the fetch units, to 512 BTB entries and 8 call-return stack entries. After the area adjustments caused by these factors are accounted for, each of the four proces- 21 mm I-Cache #1 1SKI Processor #1 D-Cache #1 (8K I D-Cache ~3 (8K) O Processor #3 I-cache #3 ~SK) 21 mm I-Cache #2 IBKI Processor #2 D-Cache #2 {8K/ D-Cache #418K) Processor #4 [e~ffI'~ ='ffl:~ llt:114 Extemal Interface t3 v ¢D to Cq co (D 0~ ¢D t- O ¢U ¢3. ¢O .m ¢_ e- E = E 0 0 (b Figure 3. Floorplan for the four-way single-chip multiprocessor. sors is less than one-fourth the size of the 6-way SS processor, as shown in Table 3. The number of execution units actually increases in the MP because the 6-way processor had three units of each type, while the 4-way MP must have four -- one for each CPU. On the other hand, the issue logic becomes dramatically smaller, due to the decrease in instruction buffer ports and the smaller number of entries in each instruction buffer. The scaling factors of these two units balance each other out, leaving the entire processor very close to one-fourth of the size of the 6-way processor. The on-chip cache hierarchy of the multiprocessor is significantly different from the cache hierarchy of the 6-way superscalar proces- sor. Each of the 4 processors has its own single-banked and single- ported 8 KB instruction and data caches that can both be accessed in a single 2 ns cycle. Since each cache can only be accessed by a single processor with a single load/store unit, no additional over- head is incurred to handle arbitration among independent memory- access units. However, since the four processors now share a single L2 cache, that cache requires an extra cycle of latency during every access to allow time for interprocessor arbitration and crossbar delay. We model this additional L2 delay by penalizing the MP an additional cycle on every L2 cache access, resulting in a 5 cycle L2 hit time. 5 Simulation Methodology Accurately evaluating the performance of the two microarchitec- tures requires a way of simulating the environment in which we would expect these architectures to be used in real systems. In this section we describe the simulation environment and the applica- tions used in this study. 5.1 Simulation Environment We execute the applications in the SimOS simulation environment [18]. SimOS models the CPUs, memory hierarchy and I/O devices
`
`0005
`
`

`

`0.3511m R10K Size Extrapolated % Growth Due to CPU Component Original Size (mm 2) to 0.251tin (mm 2) New Functionality New Size (mm 2) % Area 256K On-Chip L2 Cache a 219 112 0% 112 26% 8-bank D Cache (32 KB) 26 13 25% 17 4% 8-bank I Cache (32 KB) 28 14 25% 18 4% TLB Mechanism 10 5 200% 15 3% External Interface Unit 27 14 0% 14 3% Instruction Fetch Unit and BTB t 8 9 200% 28 6% Instruction Decode Section 21 11 250% 38 9% Instruction Queues 28 14 250% 50 12% Reorder Buffer 17 9 300% 34 9% Integer Functional Units 20 10 200% 31 7% FP Functional Units 24 12 200% 37 9% Clocking & Overhead 73 37 0% 37 9% Total Size -- -- -- 430 100% Table 2. Size extrapolations for the 6-way superscalar from the MIPS R10000 processor % Area 0.35ltm RIOK Size Extrapolated % Growth Due to (of CPU / of entire CPU Component Original Size (mm 2) to 0.2511m (mm 2) New Functionality New Size (mrn 2) chip) D Cache (8 KB) 26 13 -75% 3 6%/3% I Cache (8 KB) 28 14 -75% 4 7% / 3% TLB Mechanism 10 5 0% 5 9% / 5% Instruction Fetch Unit and BTB 18 9 -25% 7 13% / 7% Instruction Decode Section 21 11 -50% 5 10% / 5% Instruction Queues 28 14 -70% 4 8% / 4% Reorder Buffer 17 9 -80% 2 3% / 2% Integer Functional Units 20 10 0% 10 20% / 10% FP Functional Units 24 12 0% 12 23% / 12% Per-CPU Subtotal -- -- -- 53 100% / 50% 256K On-Chip L2 Cache a 219 112 0% 112 26% External Interface Unit 27 14 0% 14 3% Crossbar Between CPUs -- -- -- 50 12% Clocking & Overhead 73 37 0% 37 9% Total Size -- -- -- 424 100% Table 3. Size extrapolations in the 4 × 2-way MP from the MIPS R10000 processor. a. estimated from current LI caches of uniprocessor and multiprocessor systems in sufficient detail to boot and run a commercial operating system. SimOS uses the MIPS-2 instruction set and runs the Silicon Graphics IRIX 5.3 operating system which has been tuned for multiprocessor perfor- mance. SimOS actually simulates the operating system; therefore, all the memory references made by the operating system and the applications are generated. This feature is particularly important for the study of multiprogramming workloads where the time spent executing kernel code makes up a significant fraction of the non- idle execution time. A unique feature of SimOS that makes studies such as this feasible is that SimOS supports multiple CPU simulators that use a common instruction set architecture. This allows trade-offs to be made between the simulation speed and accuracy. The fastest CPU simu- lator, called Embra, uses binary-to-binary translation techniques and is used for booting the operating system and positioning the workload so that we can focus on interesting regions of execution. The medium performance CPU simulator, called Mipsy, is two orders of magnitude slower than Embra. Mipsy is an instruction set simulator that models all instructions with a one cycle result latency and a one cycle repeat rate. Mipsy interprets all user and privileged instructions and feeds memory references to a memory system sim- ulator. The slowest, most detailed CPU simulator is MXS, which supports dynamic scheduling, speculative execution and non-block- ing memory references. MXS is over four orders of magnitude slower than Embra. The cache and memory system component of our simulator is com- pletely event-driven and interfaces to the SimOS processor model
`
`0006
`
`

`

`Integer applications compress compresses and uncompresses file in memory eqntott translates logic equations into truth tables m88ksim Motorola 88000 CPU simulator MPsim VCS compiled Verilog simulation of a multiprocessor Floating point applications r applu solver for parabolic/elliptic partial differential equations apsi solves problems of temperature, wind, velocity, and distribution of pollutants swim shallow water model with 1K x 1K grid tomcatv mesh-generation with Thompson solver Multiprogramming application pmake parallel make of gnuchess using C compiler Table 4. The applications. which drives it. Processor memory references cause threads to be generated which keep track of the state of each memory reference and the resource usage in the memory system. A call-back mech

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket