`Clearance Center, Inc. - No Further Reproduction is Permitted
`
`Evaluation of Design Alternatives for a Multiprocessor
`Microprocessor
`
`Basem A. Nayfeh, Lance Hammond and Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305-4070
`{ bnayfeh, lance, kunle} @ogun.stanford.edu
`
`Abstract
`
`In the future, advanced integrated circuit processing and packaging
`technology will allow for several design options for multiprocessor
`microprocessors. In this paper we consider three architectures:
`shared-primary cache, shared-secondary cache, and shared-mem-
`ory. We evaluate these three architectures using a complete system
`simulation environment which models the CPU, memory hierarchy
`and V/O devices in sufficient detail to boot and run a commercial
`operating system. Within our simulation environment, we measure
`performanceusing representative hand and compiler generated par-
`allel applications, and a multiprogramming workload. Our results
`show that when applications exhibit fine-grained sharing, both
`shared-primary and shared-secondary architectures perform simi-
`larly whenthe full costs of sharing the primary cacheare included.
`
`1 Introduction
`
`With the use of advanced integrated circuit (IC) processing and
`packaging technology several options for the design of high-perfor-
`mance microprocessors are available. A design option that
`is
`becoming increasingly attractive is a multiprocessor architecture.
`Multiprocessors offer high performance on single applications by
`exploiting loop-level parallelism and provide high throughput and
`low interactive response time on multiprogramming workloads
`[2][15]. With the multiprocessor design option, a small number of
`processors are interconnected on a single die or on a multichip
`module (MCM)substrate. The abundance of wires available on-
`chip or on-MCM makeit possible to construct interprocessor com-
`munication mechanisms which have much lowerlatency and higher
`bandwidth than a single bus-based multiprocessor architecture.
`Given the multiprocessor communication implementation options
`available for
`improving interprocessor communication perfor-
`mance,it is important to understand which mechanism provides the
`best overall performance on important application classes. The
`objective of this paper is to characterize the benefits and costs of
`realistic implementations of two proposed cache-sharing mecha-
`nisms that exploit the increased wire density: shared level-1 (1)
`
`cache and shared level-2 (L2) cache. To provide a point of refer-
`ence, the performance of these architectures is compared to that of a
`conventional single bus-based shared-memory multiprocessor. All
`three architectures are simulated using a complete system simula-
`tion environment which models the CPU, memory hierarchy and
`VO devices in sufficient detail to boot and run the Silicon Graphics
`IRIX 5.3 operating system. Within our simulation environment, we
`evaluate the performance of the three architectures using represen-
`tative hand and compiler generated parallel applications, and a mul-
`tiprogramming workload. Both kernel and user level references are
`includedin our results.
`
`Wepresent twosets of results. One set with a simple CPU model
`that does not include latency hiding or the true latencies of the
`shared-L1 architecture, and a second set with a very detailed and
`completely accurate CPU model. The results from the simple CPU
`model are used to classify the parallel applications into three broad
`classes: applications with a high degree of interprocessor communi-
`cation, applications with a moderate degree of interprocessor com-
`munication and applications with little or no interprocessor
`communication. For applications in the first class we find that the
`shared-L1 architecture usually outperforms the other two architec-
`tures substantially. For applications in the second class the shared-
`LI architecture performs less than 10% better than the other archi-
`tectures. Finally, for applications in the third class, contrary to con-
`ventional wisdom, the performance of the shared-L1is still slightly
`better than the other architectures. The second set of results include
`the effects of dynamic scheduling, speculative execution and non-
`blocking memory references. These results show that when the
`additional
`latencies associated with sharing the Ll cache are
`includedin the simulation model, the performance advantageof the
`shared-L1 architecture can diminish substantially.
`
`Therest of this paper is organized as follows. Section 2 introduces
`the three multiprocessor architectures and the architectural assump-
`tions used throughout the paper. Section 3 describes the simulation
`environment and benchmark applications that are used to study
`these architectures. Simulation results of the performance of the
`three multiprocessor architectures are presented in Section 4. In
`Section 5 we discuss related work and we conclude the paper in
`Section 6.
`
`Permission to make digital/hard copy of part or all of this work for personal
`or classroom use is granted without fee provided that copies are not made
`or distributed for profit or commercial advantage, the copyright notice, the
`title of the publication andits date appear, and notice is given that —
`copying is
`by permission of ACM, Inc. To copy otherwise, to republish, to
`
`post on servers, or to redistribute to lists, requires prior specific permission
`and/ora fee.
`
`ISCA '96 5/96 PA, USA
`© 1996 ACM 0-89791-786-3/96/0005...$3.50
`
`67
`
`1
`
`SAMSUNG 1046
`
`1
`
`SAMSUNG 1046
`
`
`
`2 Three Multiprocessor Architectures
`
`The distinguishing characteristic of shared-memory multiprocessor
`architectures is the level of the memory hierarchy at which the
`CPUsare interconnected. In general, a multiprocessor architecture
`whose interconnectis closer to the CPUs in the memory hierarchy
`will be able to exploit fine-grained parallelism more efficiently than
`a multiprocessor architecture whose interconnect is further away
`from the CPUs in the memory hierarchy. Conversely, the perfor-
`mance of the closely interconnected multiprocessor will tend to be
`worse than the loosely interconnected multiprocessor when the
`CPUs are executing independent applications. With this in mind,
`the challenge in the design of a small-scale multiprocessor micro-
`processor is to achieve good performance on fine-grained parallel
`applications without sacrificing the performance of independent
`parallel jobs. To develop insight about the most appropriate level
`for connecting the CPUs in a multiprocessor microprocessor we
`will compare the performance of three multiprocessor architec-
`tures: shared-L1 cache, shared-L2 cache, and a conventional single-
`bus shared main memory. We will see that these architectures are
`natural ways to connect multiple processors using different levels of
`the electronic packaging hierarchy. Before we discuss the features
`that distinguish the three multiprocessor architectures, we will dis-
`cuss the characteristics of the CPU, which is used with all three
`memory architectures.
`
`2.1 CPU
`
`This study uses a 2-way issue processor that includes the support
`for dynamic scheduling, speculative execution, and non-blocking
`caches that one would expect to find in a modern microprocessor
`design. The processor executes instructions using a collection of
`fully pipelined functional units whose latencies are shown in
`Table 1. The load latency of the CPUis specific to the multiproces-
`sor architecture. To eliminate structural hazards there are two cop-
`ies of every functional unit except for the memory data port.
`
`ALU
`
`1
`
`
`
`SP Add/Sub
`
`
`
`
`
`
`SP Multiply
`Multiply
`2
`
`
`
`Divide
`SP Divide
`
`
`Branch
`
` Load
`DP Multiply
` Store
`DP Divide
`
`Table 1 CPU functional unit latencies.
`
`2
`
`DP Add/Sub
`
`Other characteristics of the processor are 16 Kbyte two-way set
`associative instruction and data caches, a 32 entry centralized win-
`dow instruction issue scheme and a 32 entry reorder buffer to main-
`tain precise interrupts and recover from mispredicted branches.
`Branches are predicted with a 1024 entry branchtarget buffer. The
`non-blocking L1 data cache supports up to four outstanding misses.
`
`68
`
`-
`
`The CPU is modeled using the MXSsimulator [4] which is capable
`of modeling modern microarchitectures in detail. In this simulator
`the MIPS-2 instruction set is executed using a decoupled pipeline
`consisting of fetch, execute and graduate stages. In the fetch stage
`up to two instructions are fetched from the cache and placed into
`the instruction window. Every cycle up to two instructions from the
`window whose data dependencies have been satisfied move to the
`execute stage. After execution, instructions are removed from the
`instruction window and wait in the reorder buffer until they can
`graduate,i.¢., update the permanent machinestate in program order.
`
`2.2
`
`Shared-L1 Cache Multiprocessor
`
`By the end of the century it will be possible to place multiple pro-
`cessors on a single die. A natural way to interconnect these proces-
`sors will be at the first level cache as illustrated in Figure 1. The
`figure shows four CPUsthat share a common, 4-way banked write-
`back L1 cache through a crossbar switching mechanism. This archi-
`tecture is similar to the M-machine [8]. The primary advantage of
`this architecture compared to other multiprocessor architectures is
`that it provides the lowest latency interprocessor communication
`possible using a shared-memory address space. Low latency inter-
`processor communication makes it possible to achieve high perfor-
`mance on parallel applications with fine-grained parallelism.
`Parallel application performance is also improved by processors
`that prefetch shared data into the cache for each other, eliminating
`cache misses for processorsthat use the data later. Other advantages
`of a shared-L1 cache are that it eliminates the complex cache-
`coherence logic usually associated with cache-coherent multipro-
`cessors and implicitly provides a sequentially consistent memory
`without sacrificing performance. This makes the hardware imple-
`mentation simpler and programmingeasier.
`
`There are some disadvantages to the shared-L1 cache architecture.
`The access tune of L1 cache is increased by the time required to
`pass through the crossbar between the processors and cache. We
`assume that the added overhead of the crossbar switching mecha-
`nisms and cache bank arbitration logic would makethetotal latency
`of the L1 cache three cycles, even though the cache banks would be
`pipelined to allow single-cycle accesses. However,all of the mem-
`ory references performed by the processors will enter the shared-
`memory, so there is some probability of extra delays due to bank
`conflicts between memory references from different processors. A
`third disadvantage is the converse of the shared-data advantage:
`processors working with different data can conflict in the shared
`cache, causing the miss rate to increase.
`
`Given the clock rates and complexity of the CPU-cacheinterface of
`future microprocessors a single die implementation of the shared-
`L1 cacheis essential in order to maintain a low L1 cachelatency. If
`chip boundaries were crossed, either the L1 latency would be
`increased to five or more cycles or the clock rate of the processors
`would be severely degraded. Either of these would have a signifi-
`cant impact on processor performance. The major drawback to the
`single die implementation today would be the large area and high
`cost of the die. However, the increasing density of integrated circuit
`technology will soon makeit possible to put four processors on a
`chip with a reasonable die area. We estimate the die area required
`for four processors of the complexity of the DEC Alpha 21064A [6]
`(a dualissue statically scheduled superscalar processor with 32 KB
`
`2
`
`
`
`128 bit system bus
`
`16K L1
`2-Way
`
`16K L141
`2-Way
`
`16K L1
`2-Way
`
`64-bit
`
`MCM
`processor
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`4
`128 bit Vow ad
`system bus
`
` a
`
`
`FEELETEDT
`
`
`
`
`
`
`
`
`rtwanethee4
`
`
`
`
`
`
`
`BWRBRBRERARREABRABRABERBRRALRBABEBREABDELADABRIEGD
`
`16K L1
`2-Way
`
`, 4 s
`
`128 bit L2 bus
`
`2 MB Level 2 Cache
`2-way Associative
`
`Figure 1. Shared primary cache multiprocessor.
`
`of on-chip cache) and the crossbar interconnect to be 320 mm? in
`0.35 micron technology. This is the area of the largest microproces-
`sor chips produced today. In a 0.25 micron CMOStechnology,that
`will be available by the end of 1997, the area is reduced to 160
`mm”, whichis a medium-sized chip.
`
`The L2 cache and main memories are uniprocessor-like in this sys-
`tem since they are not involved in interprocessor communication.
`This makes them relatively simple. They are designed with low
`latencies and heavy pipelining. The degiee of pipelining is prima-
`rily limited by the 128 bit L2 bus and the 32-byte cacheline size
`that we assume. The transfer time of two cycles sets the lower
`bounds on L2 cache occupancy. For the purposes of this paper we
`assume memory latencies and bandwidths that could be attained in
`a 200 MHz microprocessor with commodity SRAM L2 cache
`memory and multibanked DRAM main memory: an L2 with 10-
`cycle latency and 2-cycle occupancy (no overhead), and a main
`memory with a 50-cycle latency and a 6-cycle occupancy [7]. No
`cache-coherence mechanisms between the four processors on the
`chip are requiredat these levels of the memory hierarchy, since they
`are below the level of sharing. Only logic to keep the L2 cache
`coherent with other, completely separate processors on the system
`busis required.
`
`2.3 Shared-L2 Cache Multiprocessor
`
`The second multiprocessor architecture we consider shares data
`through the L2 cache instead of the L1 cache. A possible implemen-
`tation of this schemeis illustrated in Figure 2. Here four processors
`and the shared-L2 cacheinterface are separate dies whichare inter-
`connected using MCM packaging [16]. The four processors and
`their associated write-through L1 caches are completely indepen-
`dent. This eliminatesthe extra access time of the shared-L1 cache,
`returning the latency of the L1 cache to 1 cycle. However, the
`shared-L2 cache interface increases the L2 cache latency from 10
`cycles to 14 cycles. These extra cycles are due to crossbar overhead
`and the delay for additional chip boundary crossings [17].
`
`69
`
`512K 2-Wayy 512K 2-Wayg512K 2-Wayg
`L2 Bank
`L2 Bank
`L2 Bank
`
`512K 2-Wa
`L2 Bank
`
`Figure 2, Shared secondary cache multiprocessor.
`
`The write-back L2 cache has four independent banks to increaseits
`bandwidth and enable it
`to support
`four independent access
`streams. To reduce the pin countof the crossbar chip, which must
`support interfaces to the four processors as well as the four cache
`banks, the L2 cache datapath is 64 bits instead of 128 bits used in
`the shared-L1 cache architecture. This does have the side effect of
`increasing the occupancy of the L2 cache from two to four cycles
`for a 32-byte cache line transfer. Since we assume L2 cache is
`designed to supply the critical-word-first, this does not have a sig-
`nificant performance impact. While the additional latency of the
`crossbar will reduce L2 cache performance comparedto the shared-
`L1 case, only memory accesses that miss in the L1 cache will have
`to contend with the reduced-performance L2 cache. For the pur-
`poses of sharing, the 14 cycle communication latency will allow
`relatively fine-grained communication on multiprocessor programs
`but this latency is still much greater than the three cycle sharing
`latency of the shared-L1 cachearchitecture.
`
`The shared-L2 architecture implemented with separate chipsresults
`in a large number of interchip wires in the system. However, the
`performance critical path between a processor and its L1 cache
`remains on chip. The less-frequently used path between the L1 and
`L2 caches is more tolerant of a few cycles of additional overhead
`from crossing die boundaries since it is already 10 cycles long.
`Thus, a system in which smaller dies are packaged on an MCM
`may have a performance thatis close to a shared-L2 cache imple-
`mented on a single die while potentially being less expensive to
`build, Figure 2 shows that the four processor dies and the crossbar
`die are packaged on an MCM,while the four separate 64 bit datap-
`ath interfaces to the cache banks would go off of the MCM to sepa-
`trate SRAMs. Even with the narrower L2 cache datapaths the
`crossbar chip will still require several hundred signal pins for the
`interfaces to the processors and cache banks. This high pin countis
`only feasible today using chips with area pads that are packaged
`using MCM technology[17].
`
`The main memory for this architecture is identical to the main
`memory from the shared-L1 case, since the system below the L2
`cacheis essentially a uniprocessor memory hierarchy. For the pur-
`poses of this paper we assume 50 cycles of latency and 6 cycles of
`occupancyper access. With this configuration, some hardware must
`also be installed to keep the L1 caches coherent, at least for shared
`
`3
`
`
`
`regions of memory. The simplest way to do this is to assume that
`the L1 cache uses a write-through policy for shared data and that
`there is a directory entry associated with each L2 cache line. When
`there is a change to a cacheline caused by write or a replacement
`all processors caching the line must receive invalidates or updates
`[17]. This implementation of cache-coherency saves a considerable
`amount of snooping control logic on the processors. If this control
`logic could be eliminated the processors could be made simpler
`than current microprocessors which support snoopy cache coher-
`ence.
`
`2.4 Shared-Memory Multiprocessor
`
`L1 and L2 caches. This level of support is included in the latest
`designs from most leading manufacturers of microprocessors.
`
`Latency
`
`Occupancy
`
`Shared-L1
`
`Level 1 Cache
`
`Level 2 Cache
`
`Main
`
`Shared-L2
`
`Level 1 Cache
`
`The final architecture we consider is a traditional bus-based multi-
`processor. The processors and their individual L1 caches runat full,
`single-cycle cache speeds. This is much like the shared-L2 system.
`In addition, each processor has its own separate bank of L2 cache
`Shared-Mem.|Level 1 Cache
`that it can access at the full speed of the SRAMs, muchlike the
`shared-L1 system (latency = 10 cycles, occupancy = 2 cycles).
`
`Level 2 Cache
`
`Main
`
`Level 2 Cache
`
`Main
`
`
`
`
`
`
`|asstcw]vase4
`
`
`
`512K 2-Way§
`L2
`
`512K 2-Wayf512K 2-Way{
`La
`L2
`
`512K 2-Wa
`L2
`
`
`
`
`
`16K Li
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`
`
`
`
`128 bit system bus
`
`
`
`Figure 3.
`
`shared-memory multiprocessor.
`
`in order to communicate each processor must access
`However,
`main memory through the shared system bus, with its high latencies
`(still, latency = 50 cycles, occupancy = 6 cycles). This will tend to
`limit
`the degree of communication that
`is possible — each
`exchange will take 50 or more cycles. Even with systems designed
`to support cache-to-cache sharing of shared data, the typical times
`seen will still have a latency of approximately 50 cycles since all
`three of the other processors on the bus must check their cache tags
`for a match, agree which processor should sourcethe data, and then
`recover the necessary data from the correct cache. Since this will
`usually require accesses to the off-chip L2 caches controlled by the
`other processors while these caches are busy with local cachetraf-
`fic, and because we must wait for the slowest processor’s response
`in order to ensure coherency, typical times will often be comparable
`to memory access times in bus-based systems[7][9].
`
`This architecture represents the capabilities and limitations of cur-
`rentprinted circuit board based systems. It is worth noting that the
`processors must support full snoopy cache coherence of both their
`
`70
`
`Cache-to-Cache
`
`Table 2 A summary of the ideal memory latencies of three
`multiprocessor architectures in CPU clock cycles (1 cycle = 5 ns).
`
`Table 2 shows the contention-free access latencies for the three
`
`multiprocessor architectures. A common theme is the increased
`access timeto the level of the memory hierarchy at which the pro-
`cessors communicate. A direct result of this is that the further away
`from the processor communication takes place, the less impactit
`will have on uniprocessor performance.
`
`3 Methodology
`
`Accurately evaluating the performance of the three multiprocessor
`architectures requires a way of simulating the environment in which
`we would expect these architectures to be used in real systems. In
`this section we describe the simulation environmentand the appli-
`cations used in this study.
`
`3.1 Simulation Environment
`
`To generate the parallel memory references we use the SimOS sim-
`ulation environment [20]. SimOS models the CPUs, memory hier-
`archy and I/O devices of uniprocessor and multiprocessor systems
`in sufficient detail to boot and run a commercial operating system.
`SimOS uses the MIPS-2 instruction set and runs the Silicon Graph-
`ics IRIX 5.3 operating system which has been tuned for multipro-
`cessor performance. Because SimOS actually simulates
`the
`operating system it can generate all the memory references made by
`the operating system and the applications. This feature is particu-
`larly important for
`the study of multiprogramming workloads
`where the time spent executing kernel code makesup a significant
`fraction of the non-idle execution time.
`
`A unique feature of SimOS that makesstudies such as this feasible
`is that SimOS supports multiple CPU simulators that use a common
`instruction set architecture. This allows trade-offs to be made
`
`4
`
`
`
`between the simulation speed and accuracy. The fastest CPU simu-
`lator, called Embra, uses binary-to-binary translation techniques
`and is used for booting the operating system and positioning the
`workload so we can focus on interesting regions of the execution
`time. The medium performance CPU simulator, called Mipsy, is
`two orders of magnitude slower than Embra. Mipsy is an instruc-
`tion set simulator that models all instructions with a one cycle result
`latency and a one cycle repeat rate. Mipsy interprets all user and
`privileged instructions and feeds memoryreferences to the memory
`system simulator. The slowest, most detailed CPU simulator is
`MXS, which supports dynamic scheduling, speculative execution
`and non-blocking memory references. MXS is over four orders of
`magnitude slower than Embra.
`
`The cache and memory system componentof our simulator is com-
`pletely event-driven and interfaces to the SimOS processor model
`which drives it. Processor memory references cause threads to be
`generated which keep track of the state of each memory reference
`and the resource usage in the memory system. A call-back mecha-
`nism is used to inform the processorof the status of all outstanding
`teferences, and to inform the processor when a reference com-
`pletes. These mechanisms allow for very detailed cache and mem-
`ory system models, which include cycle accurate measures of
`contention and resource usage throughout the system.
`
`3.2 Applications
`
`We would expect a multiprocessor microprocessor architecture to
`be used in both high-performance workstations and servers. There-
`fore, we have chosen workloads that realistically represent the
`behavior of these computing environments. The parallel applica-
`tions we use fall into three classes: hand parallelized scientific and
`engineering applications, compiler parallelized scientific and engi-
`neering applications and a multiprogramming workload.
`
`To simulate each application we first boot the operating system
`using the fastest CPU simulator and then checkpoint the system
`immediately before the application begins execution. The check-
`point saves the internal state of CPU and main memory and pro-
`vides
`a
`common starting point
`for
`simulating the
`three
`architectures. Checkpoints also help to reduce the total simulation
`time by eliminating the OS boottime.
`
`3.2.1 Hand-Parallelized Applications
`
`Mostparallel applications are ones which have been developed for
`conventional multiprocessors. The majority of these applications
`come from scientific and engineering computing environments and
`are usually floating point intensive. In selecting applications we
`have attempted to include applications with both fine- and coarse-
`grained data sharing behavior.
`
`Eqntott is an integer program from the SPEC92 benchmark suite
`{27] that translates logic equations into truth tables. To parallelize
`this benchmark, we modified a single routine — the bit vector com-
`parison that is responsible for about 90% of the computation in the
`benchmark. Mostof the program runs on one master processor, but
`when the comparison routine is reached the bit vector is divided up
`among the four processors so that each processor can check a quar-
`ter of the vector in parallel. The amount of work per vectoris small
`so that the parallelism in this benchmarkis fine-grained.
`
`71
`
`MP3D[14] is a 3-dimensional particle simulator application andis
`oneof the original SPLASH benchmarks described in [22]. MP3D
`places heavy demands on the memory system because it was writ-
`ten with vector rather than parallel processors in mind. The commu-
`nication volumeis large, and the communication patterns are very
`unstructured and read-write in nature. As such, it is not considered
`to be a well-tuned parallel application, but could serve as an exam-
`ple of how applicationsinitially written for vector machines per-
`form as they are ported to shared-memory multiprocessors. In our
`experiments we simulated MP3D with 35,000 particles and 20 time
`steps.
`
`Ocean is a well written and highly optimized parallel application
`that is part of the SPLASH2 benchmarksuite [26]. Ocean simulates
`the influence of eddy and boundary currents on the large-scale flow
`in the ocean using a multigrid solver method. The ocean is divided
`into a n x n grid and each processor is assigned a square sub-grid.
`Each processor communicates with its neighbors at the boundaries
`of the subgrid. Each processor’s working setis basically the size of
`the processor’s partition of a grid, and is mostly disjoint from the
`working sets of the other processors. Forthe results in this paper we
`use an input data set that has 130 x 130 grid points.
`
`Volpack is a graphics application that implements a parallel volume
`tendering algorithm using a very efficient technique called shear-
`warp factorization [12]. The parallel algorithm uses a image based
`task decomposition in which each processor computes a portion of
`the final image in parallel. There are three steps to the parallel algo-
`rithm. In the first step a lookup table is computed in parallel for
`shading the voxels (volume elements), in the second step each pro-
`cessor computes a portion of the intermediate image by selecting
`tasks from a task queue. Each task entails computing voxels of con-
`tiguous scan lines that intersect the portion of the assigned portion
`ofthe intermediate image.In the last step, the intermediate imageis
`warpedin parallel. To minimize load imbalance, the algorithm uses
`dynamic task stealing among the processors. The application uses a
`128? voxel medical data set with a task size of two scanlines. The
`small task size is selected to maximize processor data sharing and
`minimize synchronization time.
`
`3.2.2 Compiler Parallelized Applications
`
`Recent advancesin parallel compiler technology have extended the
`range of applications that can be successfully parallelized [1].
`These advances include algorithms for interprocedural analysis of
`data dependencies, array privatization and C pointer analysis. Inter-
`procedural analysis allows the compiler to find parallelism over
`wide regions of the program and array privatization makes it possi-
`ble to parallelize loops that use arrays as temporary work areas in
`the body of the loop. Array privatization make these loops parallel
`by giving each parallel loop an independent copy of the array. A
`significant amount of data dependence analysis is required for a
`compiler to perform array privatization. Aliases occur since C pro-
`grams use pointers and pointers can refer to the same object. Such
`aliases prevent parallelization and without further information the
`compiler must assumeall pointers are aliases of each other. Using C
`pointer analysis, the compiler is able to identify the pointer aliases
`that actually occur in the program. This greatly increases the poten-
`tial for parallelization
`
`5
`
`
`
`
`
`tem; the componentof interest in this investigation. The time spent
`waiting for a spin lock or for barrier synchronization is included in
`the CPU time. The speed of the load-limked and store-conditional
`memory operations used to implement these synchronization primi-
`tives affects the amount of time the processors spend synchroniz-
`ing. Consequently,
`the level of data sharing in the memory
`hierarchy which affects the speed of these memory operations and
`the load balance in the application changes the amount of CPU time
`shownin the graphs.
`
`To provide insight into application behavior we present miss rate
`data on the working set and sharing characteristics of the applica-
`tions. We break down the miss rates for the L1 data cache and uni-
`fied L2 cache for all three architectures. Hach cache miss rate is
`
`broken into two components: replacement miss rate (LIR, L2R)
`and invalidation miss rate (L11, L2I). Replacement misses are com-
`posed of cold, capacity and conflict misses. These misses. are
`affected by the organization of the cache and the level of the mem-
`ory hierarchy at which the processors share data. Invalidation
`misses are due to communication and are most affected bythe level
`of sharing in the architecture, although the cache line size will
`affect the number of false sharing misses. All the cache miss rates
`wepresent are local miss rates: they are measured as misses per ref-
`erence to the cache.
`
`4.1 Hand Parallelized Applications
`
`® =
`
`8
`8
`3s
`E2
`z
`
`In our experiments we use two SPEC92 floating point benchmarks
`that have been parallelized by the Stanford University Intermediate
`Format (SUIF) compiler system [2]: ear and the FFT kernel from
`nasa7. Ear is a C program that models the inner ear, while the FFT
`kernel is written in FORTRAN.The FFT kernel can be parallelized
`using vector-like compiler analysis, but ear requires pointer disam-
`biguation because it is a C application and interprocedural analysis
`in orderto detect the parallelism. Using these techniques, over 90%
`of the execution time of these programs can be parallelized by the
`SUIF compiler.
`
`The parallel behavior of the two automatically parallelized pro-
`grams varies widely. In FFT the compiler is able to find outer loops
`that are parallelized across procedure boundaries so that the granu-
`larity of parallelism is fairly large. However, ear consists of very
`short running loops that perform a small amount of work per loop
`iteration; consequently,it is has an extremely small grain size.
`
`3.2.3 Multiprogramming and OS Workload
`
`One of the main uses of multiprocessor architectures will be for
`increasing the throughput of a multiprogramming workload that
`consists of both user activity and OS activity. To modelthis type of
`compute environment we use a program development workload that
`consists of the compile phase of the Modified Andrew Benchmark
`[18]. The Modified Andrew benchmark uses the gcc compiler to
`compile 17 files. Our benchmark eliminates the two phases of the
`benchmark that install the object files in a library and delete the
`object files. We use a parallel make utility that allows up to four
`compilation processes to run at a time. Two of these parallel makes
`are launched at a time.
`
`4 Results
`
`In this section, we present the results for the three architectures run-
`ning the applications described in Section 3. First we present the
`results for all seven of the applications using the Mipsy CPU
`model. These results are simple to understand because the CPU
`modelstalls for all memory operations that take longer than a cycle.
`Thus, all the time spent in the memory system contributes directly
`to the total execution time. The Mipsy results for the shared-L1
`architecture are optimistic because we do not include bank conten-
`tion and we assumea 1-cycle hit time. The motivation for this is to
`avoid penalizing the shared-L1 architecture on a CPU simulator
`that has no support for the latency hiding mechanisms of non-
`blocking caches or dynamic instruction scheduling. To investigate
`the effect of these latency hiding mechanisms, the three cycle hit
`latency, and bank contention on the performance of the shared-Li
`architecture, we use the MXSsimulator. We present results from
`the MXSsimulator for three applications with different sharing and
`working-set characteristics.
`
`Wepresent the M