throbber
This Electronic Copy of Copyrighted Material Was Made andDelivered to the Government UnderLicensefrom Copyright
`Clearance Center, Inc. - No Further Reproduction is Permitted
`
`Evaluation of Design Alternatives for a Multiprocessor
`Microprocessor
`
`Basem A. Nayfeh, Lance Hammond and Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305-4070
`{ bnayfeh, lance, kunle} @ogun.stanford.edu
`
`Abstract
`
`In the future, advanced integrated circuit processing and packaging
`technology will allow for several design options for multiprocessor
`microprocessors. In this paper we consider three architectures:
`shared-primary cache, shared-secondary cache, and shared-mem-
`ory. We evaluate these three architectures using a complete system
`simulation environment which models the CPU, memory hierarchy
`and V/O devices in sufficient detail to boot and run a commercial
`operating system. Within our simulation environment, we measure
`performanceusing representative hand and compiler generated par-
`allel applications, and a multiprogramming workload. Our results
`show that when applications exhibit fine-grained sharing, both
`shared-primary and shared-secondary architectures perform simi-
`larly whenthe full costs of sharing the primary cacheare included.
`
`1 Introduction
`
`With the use of advanced integrated circuit (IC) processing and
`packaging technology several options for the design of high-perfor-
`mance microprocessors are available. A design option that
`is
`becoming increasingly attractive is a multiprocessor architecture.
`Multiprocessors offer high performance on single applications by
`exploiting loop-level parallelism and provide high throughput and
`low interactive response time on multiprogramming workloads
`[2][15]. With the multiprocessor design option, a small number of
`processors are interconnected on a single die or on a multichip
`module (MCM)substrate. The abundance of wires available on-
`chip or on-MCM makeit possible to construct interprocessor com-
`munication mechanisms which have much lowerlatency and higher
`bandwidth than a single bus-based multiprocessor architecture.
`Given the multiprocessor communication implementation options
`available for
`improving interprocessor communication perfor-
`mance,it is important to understand which mechanism provides the
`best overall performance on important application classes. The
`objective of this paper is to characterize the benefits and costs of
`realistic implementations of two proposed cache-sharing mecha-
`nisms that exploit the increased wire density: shared level-1 (1)
`
`cache and shared level-2 (L2) cache. To provide a point of refer-
`ence, the performance of these architectures is compared to that of a
`conventional single bus-based shared-memory multiprocessor. All
`three architectures are simulated using a complete system simula-
`tion environment which models the CPU, memory hierarchy and
`VO devices in sufficient detail to boot and run the Silicon Graphics
`IRIX 5.3 operating system. Within our simulation environment, we
`evaluate the performance of the three architectures using represen-
`tative hand and compiler generated parallel applications, and a mul-
`tiprogramming workload. Both kernel and user level references are
`includedin our results.
`
`Wepresent twosets of results. One set with a simple CPU model
`that does not include latency hiding or the true latencies of the
`shared-L1 architecture, and a second set with a very detailed and
`completely accurate CPU model. The results from the simple CPU
`model are used to classify the parallel applications into three broad
`classes: applications with a high degree of interprocessor communi-
`cation, applications with a moderate degree of interprocessor com-
`munication and applications with little or no interprocessor
`communication. For applications in the first class we find that the
`shared-L1 architecture usually outperforms the other two architec-
`tures substantially. For applications in the second class the shared-
`LI architecture performs less than 10% better than the other archi-
`tectures. Finally, for applications in the third class, contrary to con-
`ventional wisdom, the performance of the shared-L1is still slightly
`better than the other architectures. The second set of results include
`the effects of dynamic scheduling, speculative execution and non-
`blocking memory references. These results show that when the
`additional
`latencies associated with sharing the Ll cache are
`includedin the simulation model, the performance advantageof the
`shared-L1 architecture can diminish substantially.
`
`Therest of this paper is organized as follows. Section 2 introduces
`the three multiprocessor architectures and the architectural assump-
`tions used throughout the paper. Section 3 describes the simulation
`environment and benchmark applications that are used to study
`these architectures. Simulation results of the performance of the
`three multiprocessor architectures are presented in Section 4. In
`Section 5 we discuss related work and we conclude the paper in
`Section 6.
`
`Permission to make digital/hard copy of part or all of this work for personal
`or classroom use is granted without fee provided that copies are not made
`or distributed for profit or commercial advantage, the copyright notice, the
`title of the publication andits date appear, and notice is given that —
`copying is
`by permission of ACM, Inc. To copy otherwise, to republish, to
`
`post on servers, or to redistribute to lists, requires prior specific permission
`and/ora fee.
`
`ISCA '96 5/96 PA, USA
`© 1996 ACM 0-89791-786-3/96/0005...$3.50
`
`67
`
`1
`
`SAMSUNG 1046
`
`1
`
`SAMSUNG 1046
`
`

`

`2 Three Multiprocessor Architectures
`
`The distinguishing characteristic of shared-memory multiprocessor
`architectures is the level of the memory hierarchy at which the
`CPUsare interconnected. In general, a multiprocessor architecture
`whose interconnectis closer to the CPUs in the memory hierarchy
`will be able to exploit fine-grained parallelism more efficiently than
`a multiprocessor architecture whose interconnect is further away
`from the CPUs in the memory hierarchy. Conversely, the perfor-
`mance of the closely interconnected multiprocessor will tend to be
`worse than the loosely interconnected multiprocessor when the
`CPUs are executing independent applications. With this in mind,
`the challenge in the design of a small-scale multiprocessor micro-
`processor is to achieve good performance on fine-grained parallel
`applications without sacrificing the performance of independent
`parallel jobs. To develop insight about the most appropriate level
`for connecting the CPUs in a multiprocessor microprocessor we
`will compare the performance of three multiprocessor architec-
`tures: shared-L1 cache, shared-L2 cache, and a conventional single-
`bus shared main memory. We will see that these architectures are
`natural ways to connect multiple processors using different levels of
`the electronic packaging hierarchy. Before we discuss the features
`that distinguish the three multiprocessor architectures, we will dis-
`cuss the characteristics of the CPU, which is used with all three
`memory architectures.
`
`2.1 CPU
`
`This study uses a 2-way issue processor that includes the support
`for dynamic scheduling, speculative execution, and non-blocking
`caches that one would expect to find in a modern microprocessor
`design. The processor executes instructions using a collection of
`fully pipelined functional units whose latencies are shown in
`Table 1. The load latency of the CPUis specific to the multiproces-
`sor architecture. To eliminate structural hazards there are two cop-
`ies of every functional unit except for the memory data port.
`
`ALU
`
`1
`
`
`
`SP Add/Sub
`
`
`
`
`
`
`SP Multiply
`Multiply
`2
`
`
`
`Divide
`SP Divide
`
`
`Branch
`
` Load
`DP Multiply
` Store
`DP Divide
`
`Table 1 CPU functional unit latencies.
`
`2
`
`DP Add/Sub
`
`Other characteristics of the processor are 16 Kbyte two-way set
`associative instruction and data caches, a 32 entry centralized win-
`dow instruction issue scheme and a 32 entry reorder buffer to main-
`tain precise interrupts and recover from mispredicted branches.
`Branches are predicted with a 1024 entry branchtarget buffer. The
`non-blocking L1 data cache supports up to four outstanding misses.
`
`68
`
`-
`
`The CPU is modeled using the MXSsimulator [4] which is capable
`of modeling modern microarchitectures in detail. In this simulator
`the MIPS-2 instruction set is executed using a decoupled pipeline
`consisting of fetch, execute and graduate stages. In the fetch stage
`up to two instructions are fetched from the cache and placed into
`the instruction window. Every cycle up to two instructions from the
`window whose data dependencies have been satisfied move to the
`execute stage. After execution, instructions are removed from the
`instruction window and wait in the reorder buffer until they can
`graduate,i.¢., update the permanent machinestate in program order.
`
`2.2
`
`Shared-L1 Cache Multiprocessor
`
`By the end of the century it will be possible to place multiple pro-
`cessors on a single die. A natural way to interconnect these proces-
`sors will be at the first level cache as illustrated in Figure 1. The
`figure shows four CPUsthat share a common, 4-way banked write-
`back L1 cache through a crossbar switching mechanism. This archi-
`tecture is similar to the M-machine [8]. The primary advantage of
`this architecture compared to other multiprocessor architectures is
`that it provides the lowest latency interprocessor communication
`possible using a shared-memory address space. Low latency inter-
`processor communication makes it possible to achieve high perfor-
`mance on parallel applications with fine-grained parallelism.
`Parallel application performance is also improved by processors
`that prefetch shared data into the cache for each other, eliminating
`cache misses for processorsthat use the data later. Other advantages
`of a shared-L1 cache are that it eliminates the complex cache-
`coherence logic usually associated with cache-coherent multipro-
`cessors and implicitly provides a sequentially consistent memory
`without sacrificing performance. This makes the hardware imple-
`mentation simpler and programmingeasier.
`
`There are some disadvantages to the shared-L1 cache architecture.
`The access tune of L1 cache is increased by the time required to
`pass through the crossbar between the processors and cache. We
`assume that the added overhead of the crossbar switching mecha-
`nisms and cache bank arbitration logic would makethetotal latency
`of the L1 cache three cycles, even though the cache banks would be
`pipelined to allow single-cycle accesses. However,all of the mem-
`ory references performed by the processors will enter the shared-
`memory, so there is some probability of extra delays due to bank
`conflicts between memory references from different processors. A
`third disadvantage is the converse of the shared-data advantage:
`processors working with different data can conflict in the shared
`cache, causing the miss rate to increase.
`
`Given the clock rates and complexity of the CPU-cacheinterface of
`future microprocessors a single die implementation of the shared-
`L1 cacheis essential in order to maintain a low L1 cachelatency. If
`chip boundaries were crossed, either the L1 latency would be
`increased to five or more cycles or the clock rate of the processors
`would be severely degraded. Either of these would have a signifi-
`cant impact on processor performance. The major drawback to the
`single die implementation today would be the large area and high
`cost of the die. However, the increasing density of integrated circuit
`technology will soon makeit possible to put four processors on a
`chip with a reasonable die area. We estimate the die area required
`for four processors of the complexity of the DEC Alpha 21064A [6]
`(a dualissue statically scheduled superscalar processor with 32 KB
`
`2
`
`

`

`128 bit system bus
`
`16K L1
`2-Way
`
`16K L141
`2-Way
`
`16K L1
`2-Way
`
`64-bit
`
`MCM
`processor
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`4
`128 bit Vow ad
`system bus
`
` a
`
`
`FEELETEDT
`
`
`
`
`
`
`
`
`rtwanethee4
`
`
`
`
`
`
`
`BWRBRBRERARREABRABRABERBRRALRBABEBREABDELADABRIEGD
`
`16K L1
`2-Way
`
`, 4 s
`
`128 bit L2 bus
`
`2 MB Level 2 Cache
`2-way Associative
`
`Figure 1. Shared primary cache multiprocessor.
`
`of on-chip cache) and the crossbar interconnect to be 320 mm? in
`0.35 micron technology. This is the area of the largest microproces-
`sor chips produced today. In a 0.25 micron CMOStechnology,that
`will be available by the end of 1997, the area is reduced to 160
`mm”, whichis a medium-sized chip.
`
`The L2 cache and main memories are uniprocessor-like in this sys-
`tem since they are not involved in interprocessor communication.
`This makes them relatively simple. They are designed with low
`latencies and heavy pipelining. The degiee of pipelining is prima-
`rily limited by the 128 bit L2 bus and the 32-byte cacheline size
`that we assume. The transfer time of two cycles sets the lower
`bounds on L2 cache occupancy. For the purposes of this paper we
`assume memory latencies and bandwidths that could be attained in
`a 200 MHz microprocessor with commodity SRAM L2 cache
`memory and multibanked DRAM main memory: an L2 with 10-
`cycle latency and 2-cycle occupancy (no overhead), and a main
`memory with a 50-cycle latency and a 6-cycle occupancy [7]. No
`cache-coherence mechanisms between the four processors on the
`chip are requiredat these levels of the memory hierarchy, since they
`are below the level of sharing. Only logic to keep the L2 cache
`coherent with other, completely separate processors on the system
`busis required.
`
`2.3 Shared-L2 Cache Multiprocessor
`
`The second multiprocessor architecture we consider shares data
`through the L2 cache instead of the L1 cache. A possible implemen-
`tation of this schemeis illustrated in Figure 2. Here four processors
`and the shared-L2 cacheinterface are separate dies whichare inter-
`connected using MCM packaging [16]. The four processors and
`their associated write-through L1 caches are completely indepen-
`dent. This eliminatesthe extra access time of the shared-L1 cache,
`returning the latency of the L1 cache to 1 cycle. However, the
`shared-L2 cache interface increases the L2 cache latency from 10
`cycles to 14 cycles. These extra cycles are due to crossbar overhead
`and the delay for additional chip boundary crossings [17].
`
`69
`
`512K 2-Wayy 512K 2-Wayg512K 2-Wayg
`L2 Bank
`L2 Bank
`L2 Bank
`
`512K 2-Wa
`L2 Bank
`
`Figure 2, Shared secondary cache multiprocessor.
`
`The write-back L2 cache has four independent banks to increaseits
`bandwidth and enable it
`to support
`four independent access
`streams. To reduce the pin countof the crossbar chip, which must
`support interfaces to the four processors as well as the four cache
`banks, the L2 cache datapath is 64 bits instead of 128 bits used in
`the shared-L1 cache architecture. This does have the side effect of
`increasing the occupancy of the L2 cache from two to four cycles
`for a 32-byte cache line transfer. Since we assume L2 cache is
`designed to supply the critical-word-first, this does not have a sig-
`nificant performance impact. While the additional latency of the
`crossbar will reduce L2 cache performance comparedto the shared-
`L1 case, only memory accesses that miss in the L1 cache will have
`to contend with the reduced-performance L2 cache. For the pur-
`poses of sharing, the 14 cycle communication latency will allow
`relatively fine-grained communication on multiprocessor programs
`but this latency is still much greater than the three cycle sharing
`latency of the shared-L1 cachearchitecture.
`
`The shared-L2 architecture implemented with separate chipsresults
`in a large number of interchip wires in the system. However, the
`performance critical path between a processor and its L1 cache
`remains on chip. The less-frequently used path between the L1 and
`L2 caches is more tolerant of a few cycles of additional overhead
`from crossing die boundaries since it is already 10 cycles long.
`Thus, a system in which smaller dies are packaged on an MCM
`may have a performance thatis close to a shared-L2 cache imple-
`mented on a single die while potentially being less expensive to
`build, Figure 2 shows that the four processor dies and the crossbar
`die are packaged on an MCM,while the four separate 64 bit datap-
`ath interfaces to the cache banks would go off of the MCM to sepa-
`trate SRAMs. Even with the narrower L2 cache datapaths the
`crossbar chip will still require several hundred signal pins for the
`interfaces to the processors and cache banks. This high pin countis
`only feasible today using chips with area pads that are packaged
`using MCM technology[17].
`
`The main memory for this architecture is identical to the main
`memory from the shared-L1 case, since the system below the L2
`cacheis essentially a uniprocessor memory hierarchy. For the pur-
`poses of this paper we assume 50 cycles of latency and 6 cycles of
`occupancyper access. With this configuration, some hardware must
`also be installed to keep the L1 caches coherent, at least for shared
`
`3
`
`

`

`regions of memory. The simplest way to do this is to assume that
`the L1 cache uses a write-through policy for shared data and that
`there is a directory entry associated with each L2 cache line. When
`there is a change to a cacheline caused by write or a replacement
`all processors caching the line must receive invalidates or updates
`[17]. This implementation of cache-coherency saves a considerable
`amount of snooping control logic on the processors. If this control
`logic could be eliminated the processors could be made simpler
`than current microprocessors which support snoopy cache coher-
`ence.
`
`2.4 Shared-Memory Multiprocessor
`
`L1 and L2 caches. This level of support is included in the latest
`designs from most leading manufacturers of microprocessors.
`
`Latency
`
`Occupancy
`
`Shared-L1
`
`Level 1 Cache
`
`Level 2 Cache
`
`Main
`
`Shared-L2
`
`Level 1 Cache
`
`The final architecture we consider is a traditional bus-based multi-
`processor. The processors and their individual L1 caches runat full,
`single-cycle cache speeds. This is much like the shared-L2 system.
`In addition, each processor has its own separate bank of L2 cache
`Shared-Mem.|Level 1 Cache
`that it can access at the full speed of the SRAMs, muchlike the
`shared-L1 system (latency = 10 cycles, occupancy = 2 cycles).
`
`Level 2 Cache
`
`Main
`
`Level 2 Cache
`
`Main
`
`
`
`
`
`
`|asstcw]vase4
`
`
`
`512K 2-Way§
`L2
`
`512K 2-Wayf512K 2-Way{
`La
`L2
`
`512K 2-Wa
`L2
`
`
`
`
`
`16K Li
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`16K L1
`2-Way
`
`
`
`
`
`128 bit system bus
`
`
`
`Figure 3.
`
`shared-memory multiprocessor.
`
`in order to communicate each processor must access
`However,
`main memory through the shared system bus, with its high latencies
`(still, latency = 50 cycles, occupancy = 6 cycles). This will tend to
`limit
`the degree of communication that
`is possible — each
`exchange will take 50 or more cycles. Even with systems designed
`to support cache-to-cache sharing of shared data, the typical times
`seen will still have a latency of approximately 50 cycles since all
`three of the other processors on the bus must check their cache tags
`for a match, agree which processor should sourcethe data, and then
`recover the necessary data from the correct cache. Since this will
`usually require accesses to the off-chip L2 caches controlled by the
`other processors while these caches are busy with local cachetraf-
`fic, and because we must wait for the slowest processor’s response
`in order to ensure coherency, typical times will often be comparable
`to memory access times in bus-based systems[7][9].
`
`This architecture represents the capabilities and limitations of cur-
`rentprinted circuit board based systems. It is worth noting that the
`processors must support full snoopy cache coherence of both their
`
`70
`
`Cache-to-Cache
`
`Table 2 A summary of the ideal memory latencies of three
`multiprocessor architectures in CPU clock cycles (1 cycle = 5 ns).
`
`Table 2 shows the contention-free access latencies for the three
`
`multiprocessor architectures. A common theme is the increased
`access timeto the level of the memory hierarchy at which the pro-
`cessors communicate. A direct result of this is that the further away
`from the processor communication takes place, the less impactit
`will have on uniprocessor performance.
`
`3 Methodology
`
`Accurately evaluating the performance of the three multiprocessor
`architectures requires a way of simulating the environment in which
`we would expect these architectures to be used in real systems. In
`this section we describe the simulation environmentand the appli-
`cations used in this study.
`
`3.1 Simulation Environment
`
`To generate the parallel memory references we use the SimOS sim-
`ulation environment [20]. SimOS models the CPUs, memory hier-
`archy and I/O devices of uniprocessor and multiprocessor systems
`in sufficient detail to boot and run a commercial operating system.
`SimOS uses the MIPS-2 instruction set and runs the Silicon Graph-
`ics IRIX 5.3 operating system which has been tuned for multipro-
`cessor performance. Because SimOS actually simulates
`the
`operating system it can generate all the memory references made by
`the operating system and the applications. This feature is particu-
`larly important for
`the study of multiprogramming workloads
`where the time spent executing kernel code makesup a significant
`fraction of the non-idle execution time.
`
`A unique feature of SimOS that makesstudies such as this feasible
`is that SimOS supports multiple CPU simulators that use a common
`instruction set architecture. This allows trade-offs to be made
`
`4
`
`

`

`between the simulation speed and accuracy. The fastest CPU simu-
`lator, called Embra, uses binary-to-binary translation techniques
`and is used for booting the operating system and positioning the
`workload so we can focus on interesting regions of the execution
`time. The medium performance CPU simulator, called Mipsy, is
`two orders of magnitude slower than Embra. Mipsy is an instruc-
`tion set simulator that models all instructions with a one cycle result
`latency and a one cycle repeat rate. Mipsy interprets all user and
`privileged instructions and feeds memoryreferences to the memory
`system simulator. The slowest, most detailed CPU simulator is
`MXS, which supports dynamic scheduling, speculative execution
`and non-blocking memory references. MXS is over four orders of
`magnitude slower than Embra.
`
`The cache and memory system componentof our simulator is com-
`pletely event-driven and interfaces to the SimOS processor model
`which drives it. Processor memory references cause threads to be
`generated which keep track of the state of each memory reference
`and the resource usage in the memory system. A call-back mecha-
`nism is used to inform the processorof the status of all outstanding
`teferences, and to inform the processor when a reference com-
`pletes. These mechanisms allow for very detailed cache and mem-
`ory system models, which include cycle accurate measures of
`contention and resource usage throughout the system.
`
`3.2 Applications
`
`We would expect a multiprocessor microprocessor architecture to
`be used in both high-performance workstations and servers. There-
`fore, we have chosen workloads that realistically represent the
`behavior of these computing environments. The parallel applica-
`tions we use fall into three classes: hand parallelized scientific and
`engineering applications, compiler parallelized scientific and engi-
`neering applications and a multiprogramming workload.
`
`To simulate each application we first boot the operating system
`using the fastest CPU simulator and then checkpoint the system
`immediately before the application begins execution. The check-
`point saves the internal state of CPU and main memory and pro-
`vides
`a
`common starting point
`for
`simulating the
`three
`architectures. Checkpoints also help to reduce the total simulation
`time by eliminating the OS boottime.
`
`3.2.1 Hand-Parallelized Applications
`
`Mostparallel applications are ones which have been developed for
`conventional multiprocessors. The majority of these applications
`come from scientific and engineering computing environments and
`are usually floating point intensive. In selecting applications we
`have attempted to include applications with both fine- and coarse-
`grained data sharing behavior.
`
`Eqntott is an integer program from the SPEC92 benchmark suite
`{27] that translates logic equations into truth tables. To parallelize
`this benchmark, we modified a single routine — the bit vector com-
`parison that is responsible for about 90% of the computation in the
`benchmark. Mostof the program runs on one master processor, but
`when the comparison routine is reached the bit vector is divided up
`among the four processors so that each processor can check a quar-
`ter of the vector in parallel. The amount of work per vectoris small
`so that the parallelism in this benchmarkis fine-grained.
`
`71
`
`MP3D[14] is a 3-dimensional particle simulator application andis
`oneof the original SPLASH benchmarks described in [22]. MP3D
`places heavy demands on the memory system because it was writ-
`ten with vector rather than parallel processors in mind. The commu-
`nication volumeis large, and the communication patterns are very
`unstructured and read-write in nature. As such, it is not considered
`to be a well-tuned parallel application, but could serve as an exam-
`ple of how applicationsinitially written for vector machines per-
`form as they are ported to shared-memory multiprocessors. In our
`experiments we simulated MP3D with 35,000 particles and 20 time
`steps.
`
`Ocean is a well written and highly optimized parallel application
`that is part of the SPLASH2 benchmarksuite [26]. Ocean simulates
`the influence of eddy and boundary currents on the large-scale flow
`in the ocean using a multigrid solver method. The ocean is divided
`into a n x n grid and each processor is assigned a square sub-grid.
`Each processor communicates with its neighbors at the boundaries
`of the subgrid. Each processor’s working setis basically the size of
`the processor’s partition of a grid, and is mostly disjoint from the
`working sets of the other processors. Forthe results in this paper we
`use an input data set that has 130 x 130 grid points.
`
`Volpack is a graphics application that implements a parallel volume
`tendering algorithm using a very efficient technique called shear-
`warp factorization [12]. The parallel algorithm uses a image based
`task decomposition in which each processor computes a portion of
`the final image in parallel. There are three steps to the parallel algo-
`rithm. In the first step a lookup table is computed in parallel for
`shading the voxels (volume elements), in the second step each pro-
`cessor computes a portion of the intermediate image by selecting
`tasks from a task queue. Each task entails computing voxels of con-
`tiguous scan lines that intersect the portion of the assigned portion
`ofthe intermediate image.In the last step, the intermediate imageis
`warpedin parallel. To minimize load imbalance, the algorithm uses
`dynamic task stealing among the processors. The application uses a
`128? voxel medical data set with a task size of two scanlines. The
`small task size is selected to maximize processor data sharing and
`minimize synchronization time.
`
`3.2.2 Compiler Parallelized Applications
`
`Recent advancesin parallel compiler technology have extended the
`range of applications that can be successfully parallelized [1].
`These advances include algorithms for interprocedural analysis of
`data dependencies, array privatization and C pointer analysis. Inter-
`procedural analysis allows the compiler to find parallelism over
`wide regions of the program and array privatization makes it possi-
`ble to parallelize loops that use arrays as temporary work areas in
`the body of the loop. Array privatization make these loops parallel
`by giving each parallel loop an independent copy of the array. A
`significant amount of data dependence analysis is required for a
`compiler to perform array privatization. Aliases occur since C pro-
`grams use pointers and pointers can refer to the same object. Such
`aliases prevent parallelization and without further information the
`compiler must assumeall pointers are aliases of each other. Using C
`pointer analysis, the compiler is able to identify the pointer aliases
`that actually occur in the program. This greatly increases the poten-
`tial for parallelization
`
`5
`
`

`

`
`
`tem; the componentof interest in this investigation. The time spent
`waiting for a spin lock or for barrier synchronization is included in
`the CPU time. The speed of the load-limked and store-conditional
`memory operations used to implement these synchronization primi-
`tives affects the amount of time the processors spend synchroniz-
`ing. Consequently,
`the level of data sharing in the memory
`hierarchy which affects the speed of these memory operations and
`the load balance in the application changes the amount of CPU time
`shownin the graphs.
`
`To provide insight into application behavior we present miss rate
`data on the working set and sharing characteristics of the applica-
`tions. We break down the miss rates for the L1 data cache and uni-
`fied L2 cache for all three architectures. Hach cache miss rate is
`
`broken into two components: replacement miss rate (LIR, L2R)
`and invalidation miss rate (L11, L2I). Replacement misses are com-
`posed of cold, capacity and conflict misses. These misses. are
`affected by the organization of the cache and the level of the mem-
`ory hierarchy at which the processors share data. Invalidation
`misses are due to communication and are most affected bythe level
`of sharing in the architecture, although the cache line size will
`affect the number of false sharing misses. All the cache miss rates
`wepresent are local miss rates: they are measured as misses per ref-
`erence to the cache.
`
`4.1 Hand Parallelized Applications
`
`® =
`
`8
`8
`3s
`E2
`z
`
`In our experiments we use two SPEC92 floating point benchmarks
`that have been parallelized by the Stanford University Intermediate
`Format (SUIF) compiler system [2]: ear and the FFT kernel from
`nasa7. Ear is a C program that models the inner ear, while the FFT
`kernel is written in FORTRAN.The FFT kernel can be parallelized
`using vector-like compiler analysis, but ear requires pointer disam-
`biguation because it is a C application and interprocedural analysis
`in orderto detect the parallelism. Using these techniques, over 90%
`of the execution time of these programs can be parallelized by the
`SUIF compiler.
`
`The parallel behavior of the two automatically parallelized pro-
`grams varies widely. In FFT the compiler is able to find outer loops
`that are parallelized across procedure boundaries so that the granu-
`larity of parallelism is fairly large. However, ear consists of very
`short running loops that perform a small amount of work per loop
`iteration; consequently,it is has an extremely small grain size.
`
`3.2.3 Multiprogramming and OS Workload
`
`One of the main uses of multiprocessor architectures will be for
`increasing the throughput of a multiprogramming workload that
`consists of both user activity and OS activity. To modelthis type of
`compute environment we use a program development workload that
`consists of the compile phase of the Modified Andrew Benchmark
`[18]. The Modified Andrew benchmark uses the gcc compiler to
`compile 17 files. Our benchmark eliminates the two phases of the
`benchmark that install the object files in a library and delete the
`object files. We use a parallel make utility that allows up to four
`compilation processes to run at a time. Two of these parallel makes
`are launched at a time.
`
`4 Results
`
`In this section, we present the results for the three architectures run-
`ning the applications described in Section 3. First we present the
`results for all seven of the applications using the Mipsy CPU
`model. These results are simple to understand because the CPU
`modelstalls for all memory operations that take longer than a cycle.
`Thus, all the time spent in the memory system contributes directly
`to the total execution time. The Mipsy results for the shared-L1
`architecture are optimistic because we do not include bank conten-
`tion and we assumea 1-cycle hit time. The motivation for this is to
`avoid penalizing the shared-L1 architecture on a CPU simulator
`that has no support for the latency hiding mechanisms of non-
`blocking caches or dynamic instruction scheduling. To investigate
`the effect of these latency hiding mechanisms, the three cycle hit
`latency, and bank contention on the performance of the shared-Li
`architecture, we use the MXSsimulator. We present results from
`the MXSsimulator for three applications with different sharing and
`working-set characteristics.
`
`Wepresent the M

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket