`
`
`
`
`
`ComputerArchitecture
`A Quantitative Approach
`
`Third Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University ofCalifornia at Berkeley
`
`With Contributions by
`
`David Goldberg
`Xerox Palo Alto Research Center
`
`Krste Asanovic
`Department ofElectrical Engineering and Computer Science
`Massachusetts Institute oftechnology
`
`mM 7d
`
`MORGAN KAUFMANN PUBLISHERS
`
`AN IMPRINT OF ELSEVIER SCIENCE
`AMSTERDAM BOSTON
`LONDON
`NEW YORK
`OXFORD
`PARIS
`SAN DIEGO
`SAN FRANCISCO
`SINGAPORE
`SYDNEY
`TOKYO
`
`
`
`Senior Editor Denise E. M. Penrose
`Assistant Publishing Services Manager Edward Wade
`Senior Production Editor Cheri Palmer
`Editorial Coordinator Alyson Day
`Cover Design Ross Carron Design
`Cover Image Greg Pease / gettyimages
`Text Design Rebecca Evans & Associates
`Technical Illustration Lineworks,Inc.
`Composition Nancy Logan
`Copyeditor Ken DellaPenta
`Proofreader
`Jennifer McClain
`Indexer Ty Koontz
`Printer Courier Corporation
`Designations used by companiesto distinguish their products are often claimed as trademarks or reg-
`istered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the
`product names appearin initial capital or all capital letters. Readers, however, should contact the
`appropriate companies for more complete information regarding trademarks and registration.
`
`Morgan Kaufmann Publishers
`An Imprint of Elsevier Science
`340 Pine Street, Sixth Floor, San Francisco, CA 94104-3205, USA
`www.mkp.com
`
`© 1990, 1996, 2003 by Elsevier Science (USA)
`All rights reserved
`
`Published 1990. Third edition 2003
`Printed in the United States of America
`07
`06
`05
`04
`03
`10
`9
`8
`
`7 6.5 4.93 2.1
`
`Nopart ofthis publication may be reproduced,storedin a retrieval system,or transmitted in any form
`or by any means—electronic, mechanical, photocopying, or otherwise—withoutthe prior written per-
`mission of the publisher.
`
`ADVICE, PRAISE, & ERRORS:Any correspondencerelated to this publication or intended for the
`authors should be addressed to ca3comments@mkp.com. Information regarding errorsightings is also
`encouraged. Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability of the new
`printing. Bugs canbe sent to cabugs@mkp.com.(Please include your full name and permanent mail-
`ing address.)
`
`Library of Congress Control Number: 2001099789
`
`ISBN 1-55860-596-7 (cloth)
`ISBN 1-55860-724-2 (paper)
`
`This bookis printed on acid-free paper.
`
`
`
`528 «= Chapter Six Multiprocessors and Thread-Level Parallelism
`
`Introduction
`
`As the quotations that open this chapter show, the view that advances in uni-
`processorarchitecture were nearing an end has been widely held at varying times.
`To counter this view, we observe that during the period 1985-2000, eeae
`performance growth, driven by the microprocessor, was at its highest rate since
`thefirst transistorized computersin the late 1950s and early 1960s.
`On balance, though, we believe that parallel processors will definitely have a
`bigger role in the future. This view is driven by three observations. First, since
`microprocessorsare likely to remain the dominant uniprocessor technology, the
`logical way to improve performance beyonda single processor is by connecting
`multiple microprocessors together. This combination is likely to be morecost-
`effective than designing a custom processor. Second, it is unclear whether the
`pace ofarchitectural innovation that has been based for more than 15 years on
`increased exploitation of instruction-level parallelism can be sustained indefi-
`nitely. As we saw in Chapters 3 and 4, modern multiple-issue processors have
`become incredibly complex, and the performance increases achieved through
`increasing complexity,
`increasing silicon, and increasing power seem to be
`diminishing. Third, there appears to be slow but steady progress on the major
`obstacle to widespread use ofparallel processors, namely, software. This progress
`is probably faster in the server and embedded markets, as we discussed in Chap-
`ters 3 and 4. Server and embedded applications exhibit natural parallelism that
`can be exploited without some of the burdens of rewriting a gigantic software
`base. This is more of a challenge in the desktop space.
`We, however, are extremely reluctant to predict the death of advancesin uni-
`processor architecture. Indeed, we believe that the rapid rate of performance
`growth will continueat least for the next five years. Whether this pace of innova-
`tion can besustained longeris difficult to predict but hard to bet against. None-
`theless, if the pace of progress in uniprocessors does slow down, multiprocessor
`architectures will become increasingly attractive.
`Thatsaid, weare left with two problems. First, multiprocessor architecture is
`a large anddiversefield, and muchofthe field is in its youth, with ideas coming
`and going and, until very recently, more architectures failing than succeeding.
`Giventhat we are already on page 528, full coverage of the multiprocessor design
`space andits trade-offs would require another volume. (Indeed, Culler, Singh,
`and Gupta [1999] cover only multiprocessors in their 1000-page book!) Second,
`such coverage would necessarily entail discussing approaches that may notstand
`the test of time, something we have largely avoided to this point. For these rea-
`sons, we have chosen to focus on the mainstream of multiprocessor design: mul-
`tiprocessors with small to medium numbers of processors (< 128). Such designs
`vastly dominate in terms of both units and dollars. We will pay only slightatten-
`tion to the larger-scale multiprocessor design space (= 128 processors). At the
`present, the future architecture of such multiprocessors is unsettled, and even the
`viability of that marketplace is in doubt. We will return to this topic briefly at the
`end of the chapter, in Section 6.15.
`
`
`
`6.1
`
`Introduction
`
`» 529
`
`A TaxonomyofParallel Architectures
`
`We begin this chapter with a taxonomy so that you can appreciate both the
`breadth of design alternatives for multiprocessors and the contextthathasled to
`the development of the dominant form of multiprocessors. Webriefly describe
`the alternatives and the rationale behind them; a longer description of how these
`different models were born (and often died) can be found in the historical per-
`spective at the endof the chapter.
`The idea of using multiple processors both to increase performance and to
`improve availability dates back to the earliest electronic computers. About 30
`years ago, Flynn [1966] proposed a simple model ofcategorizing all computers
`thatis still useful today. He looked at the parallelism in the instruction and data
`streams called for by the instructions at the most constrained componentof the
`multiprocessor, and placed all computers into one of four categories:
`
`1. Single instruction stream, single data stream (SISD)—Thiscategory is the
`uniprocessor.
`2. Single instruction stream, multiple data streams (SIMD)—Thesameinstruc-
`tion is executed by multiple processors using different data streams. Each
`processorhasits own data memory (hence multiple data), but there is a single
`instruction memory and control processor, which fetches and dispatches
`instructions. The multimedia extensions we considered in Chapter 2 are a
`limited form of SIMDparallelism. Vector architecturesare the largest class of
`processorsof this type.
`3. Multiple instruction streams, single data stream (MISD)—No commercial
`multiprocessor of this type has been built to date, but may bein the future.
`Somespecial-purpose stream processors approximate a limited form ofthis
`(there is only a single data stream that is operated on by successive functional
`units).
`4. Multiple instruction streams, multiple data streams (MIMD)—Eachproces-
`sor fetches its own instructions and operates on its own data. The processors
`are often off-the-shelf microprocessors.
`
`This is a coarse model, as some multiprocessors are hybrids of these categories.
`Nonetheless, it is useful to put a framework on the design space.
`Asdiscussedin the historical perspectives, many of the early multiprocessors
`were SIMD, and the SIMD modelreceived renewedattention in the 1980s, and
`except for vector processors, was gone by the mid-1990s. MIMDhasclearly
`emerged as the architecture of choice for general-purpose multiprocessors. Two
`factors are primarily responsible for the rise of the MIMD multiprocessors:
`
`1. MIMDsoffer flexibility, With the correct hardware and software support,
`MIMDscanfunction as single-user multiprocessors focusing on high perfor-
`mance for one application, as multiprogrammed multiprocessors running
`manytasks simultaneously, or as some combination of these functions.
`
`
`
`530 « Chapter Six Multiprocessors and Thread-Level Parallelism
`
`2. MIMDs can build on the cost-performance advantages of off-the-shelf
`microprocessors.In fact, nearly all multiprocessors built today use the same
`microprocessors found in workstations and single-processor Servers.
`
`With an MIMD,each processor is executing its own instruction stream. In
`many cases, each processor executes a different process. Recall from the last
`chapter that a process is a segment of code that may be run independently, and
`that the state of the process containsall the information necessary to executethat
`program onaprocessor. In a multiprogrammed environment, where the proces-
`sors may be running independenttasks, each process is typically independent of
`the processes on other processors.
`It is also useful to be able to have multiple processors executing a single pro-
`gram and sharing the code and mostof their address space. When multiple pro-
`cesses share code and data in this way, they are often called threads. Today, the
`term threadis often used in a casual wayto refer to multiple loci of executionthat
`may run on different processors, even when they do not share an addressspace.
`To take advantage of an MIMD multiprocessor with n processors, we must
`usually have at least n threads or processes to execute. The independent threads
`are typically identified by the programmeror created by the compiler. Since the
`parallelism in this situation is contained in the threads, it is called thread-level
`parallelism.
`independent processes—for example,
`Threads may vary from large-scale,
`independent programs running in a multiprogrammedfashion on different proces-
`sors—to parallel iterations of a loop, automatically generated by a compiler and
`each executing for perhapsless than a thousandinstructions. Although the size of a
`thread is important in considering how to exploit thread-level parallelism effi-
`ciently, the important qualitative distinction is that such parallelism is identified at
`a high level by the software system and that the threads consist of hundreds to mil-
`lions of instructions that may be executedin parallel. In contrast, instruction-level
`parallelism is identified primarily by the hardware, although with software help in
`some cases, and is found and exploited one instruction at a time.
`Existing MIMD multiprocessorsfall into two classes, depending on the num-
`ber of processors involved, which in turn dictate a memory organization and
`interconnectstrategy. Werefer to the multiprocessors by their memory organiza-
`tion because what constitutes a small or large number of processorsis likely to
`changeovertime.
`The first group, which wecall centralized shared-memory architectures, has
`at most a few dozenprocessors in 2000. For multiprocessors with small processor
`counts, it is possible for the processors to share a single centralized memory and
`to interconnect the processors and memory by a bus. With large caches, the bus
`and the single memory, possibly with multiple banks, can satisfy the memory
`demandsof a small numberof processors. By replacing a single bus with multi-
`ple buses, or even a switch, a centralized shared-memory design can be scaled to
`a few dozen processors. Although scaling beyondthat is technically conceivable,
`sharing a centralized memory, even organized as multiple banks, becomes less
`attractive as the numberof processors sharingit increases,
`
`
`
`
`
`6.1
`
`Introduction
`
`=
`
`531
`
`Becausethere is a single main memory that has a symmetric relationship to
`all processors and a uniform access time from any processor, these multiproces-
`sors are often called symmetric (shared-memory) multiprocessors (SMPs), and
`this style of architecture is sometimes called uniform memory access (UMA),
`This type of centralized shared-memory architecture is currently by far the most
`popular organization. Figure 6.1 shows what these multiprocessorslooklike. The
`architecture of such multiprocessorsis the topic of Section 6.3.
`The second group consists of multiprocessors with physically distributed
`memory. To support larger processor counts, memory mustbedistributed among
`the processors rather than centralized; otherwise the memory system would not
`be able to support the bandwidth demandsofa larger number ofprocessors with-
`out incurring excessively long access latency. With the rapid increase in processor
`performance and the associated increase in a processor’s memory bandwidth
`requirements, the scale of multiprocessor for which distributed memory is pre-
`ferred over a single, centralized memory continues to decrease in number (which
`is another reason not to use small and large scale). Of course, the larger number
`of processors raises the need for a high bandwidth interconnect, of which wewill
`see examples in Chapter 8. Both direct interconnection networks(i.e., switches)
`and indirect networks (typically multidimensional meshes) are used. Figure 6.2
`showswhatthese multiprocessors looklike.
`
`Processor
`
`Processor
`
`Processor
`
`Processor
`
`of cache
`
`One or
`morelevels
`of cache
`
`One or
`more levels
`of cache
`
`One or
`morelevels
`of cache
`
`Oneor
`morelevels
`
`Main memory
`
`VO system
`
`
`
`Figure 6.1 Basic structure of a centralized shared-memory multiprocessor. Multiple
`processor-cache subsystemsshare the same physical memory, typically connected by a
`bus.In larger designs, multiple buses, or even a switch may be used, but the key archi-
`tectural property—uniform accesstime to all memoryfrom all processors—remains.
`
`
`
`Bae
`
`Chapter Six Multiprocessors and Thread-Level Parallelism
`
` Processor
`+ cache
`
`
`
`
`vo
`+ cache
`
`+ cache
`
`+ cache
`
`Figure 6.2 The basic architecture of a distributed-memory multiprocessorconsists
`of individual nodes containing a processor, some memory, typically some /0, and
`an interface to an interconnection network that connectsall the nodes.Individual
`nodes maycontain a small number of processors, which may be interconnected by a
`small bus ora different interconnection technology, whichis less scalable than the glo-
`bal interconnection network.
`
`Distributing the memory amongthe nodes has two majorbenefits. First, it is a
`cost-effective way to scale the memory bandwidth if most of the accesses are to
`the local memory in the node. Second, it reduces the latency for accessesto the
`local memory. These two advantages make distributed memory attractive at
`smaller processor counts as processors get ever faster and require more memory
`bandwidth and lower memory latency. The key disadvantage for a distributed-
`memory architecture is that communicating data between processors becomes
`somewhat more complex andhashigherlatency, at least when there is no conten-
`tion, because the processors no longer share a single, centralized memory. As we
`will see shortly, the use of distributed memory leads to two different paradigms
`for interprocessor communication.
`Typically, I/O as well as memory is distributed among the nodesof the multi-
`processor, and the nodes may be small SMPs(twoto eight processors). Although
`the use of multiple processors in a node together with a memory and a network
`interface may be quite useful from a cost-efficiency viewpoint, it is not funda-
`mental to how these multiprocessors work, and so we will focus on the one-
`processor-per-node design for most of this chapter.
`
`Models for Communication and Memory Architecture
`
`As discussed earlier, any large-scale multiprocessor must use multiple memories
`that are physically distributed with the processors. There are two alternative
`architectural approachesthat differ in the method used for communicating data
`among processors.
`
`
`
`6.1
`
`Introduction
`
`»
`
`533
`
`In the first method, communication occurs through a shared address Space,
`That is,
`the physically separate memories can be addressed as onelogically
`shared address space, meaning that a memory reference can be madeby any pro-
`cessor to any memory location, assumingit has the correct access rights. These
`multiprocessors are called distributed shared-memory (DSM)architectures. The
`term shared memory refers to the fact that the address spaceis shared; thatis, the
`same physical address on two processors refers to the samelocation in memory,
`Shared memory does not meanthatthere is a single, centralized memory. In con-
`trast to the symmetric shared-memory multiprocessors, also known as UMAs
`(uniform memory access), the DSM multiprocessors are also called NUMAs
`(nonuniform memory access), since the access time depends on thelocation ofa
`data word in memory.
`Alternatively, the address space can consist of multiple private address spaces
`that are logically disjoint and cannot be addressed by a remote processor. In such
`multiprocessors, the same physical address on two different processors refers to
`two different locations in two different memories. Each processor-memory mod-
`ule is essentially a separate computer; therefore, these parallel processors have
`been called multicomputers. A multicomputer can even consist of completely
`separate computers connected ona local area network, which today are popularly
`called clusters. For applications that require little or no communication and can
`make use of separate memories, such clusters of processors, whether using a
`standardized or customized interconnect, can form a very cost-effective approach
`(see Section 8.10).
`With each of these organizationsfor the address space, there is an associated
`communication mechanism. For a multiprocessor with a shared address space,
`that address space can be used to communicate data implicitly via load and store
`operations; hence the name shared memory for such multiprocessors. For a multi-
`processorwith multiple address spaces, communication of data is done by explic-
`itly passing messages amongthe processors. Therefore, these multiprocessorsare
`often called message-passing multiprocessors.
`In message-passing multiprocessors, communication occurs by sending mes-
`sages that request action or deliver data, just as with the network protocols dis-
`cussed in Section 8.2. For example, if one processor wants to access or operate on
`data in a remote memory,it can send a message to request the data or to perform
`some operation on the data. In such cases, the message can be thoughtof as a
`remote procedure call (RPC). When the destination processor receives the mes-
`sage, either by polling for it or via an interrupt, it performs the operation or
`access on behalf of the remote processor and returnsthe result with a reply mes-
`sage. This type of messagepassingis also called synchronous, since the initiating
`processor sends a request and waits until the reply is returned before continuing.
`Software systems have been constructed to encapsulate the details of sending and
`receiving messages, including passing complex argumentsor return values,pre-
`senting a clean RPCfacility to the programmer.
`Communication can also occur from the viewpointof the writer of data rather
`than the reader, and this can be moreefficient when the processor producing data
`
`
`
`534 a Chapter Six Multiprocessors and Thread-Level Parallelism
`
`knows which other processors will need the data. In such cases, the data can be
`sent directly to the consumer process without having to be requested first. It is
`often possible to perform such message sends asynchronously, allowing the
`senderprocessto continue immediately. Often the receiver will wantto block if it
`tries to receive the messagebefore it has arrived; in other cases, the reader may
`check whether a message is pending before actually trying to perform a blocking
`receive. Also the sender must be prepared to block if the receiver has not yet con-
`sumed an earlier message and no buffer space is available. The message-passing
`facilities offered in different multiprocessors are fairly diverse. To ease program
`portability, standard message-passing libraries (for example, message-passing
`interface, or MPI) have been proposed. Suchlibraries sacrifice some performance
`to achieve a commoninterface.
`
`Performance Metrics for Communication Mechanisms
`
`Three performancemetricsarecritical in any communication mechanism:
`
`1. Communication bandwidth—Ideally the communication bandwidth is lim-
`ited by processor, memory, and interconnection bandwidths, rather than by
`some aspect of the communication mechanism. The bisection bandwidth
`(see Section 8.5) is determined by the interconnection network. The band-
`width in or out of a single node, which is often as importantas bisection
`bandwidth, is affected both by the architecture within the node and by the
`communication mechanism. How does the communication mechanism affect
`the communication bandwidth of a node? When communication occurs,
`resources within the nodes involved in the communication are tied up or
`occupied, preventing other outgoing or incoming communication. Whenthis
`occupancyis incurred for each word of a message, it sets an absolute limit
`on the communication bandwidth. This limit is often lower than whatthe
`network or memory system can provide. Occupancy may also have a compo-
`nent that is incurred for each communication event, such as an incoming or
`outgoing request. In the latter case, the occupancy limits the communication
`rate, and the impact of the occupancy on overall communication bandwidth
`dependsonthe size of the messages.
`2. Communication latency—Ideally the latency is as low as possible. As we will
`see in Chapter 8,
`
`Communication latency = Sender overhead + Timeofflight
`+ Transmission time + Receiver overhead
`
`Timeofflight is fixed and transmission time is determined by the interconnec-
`tion network. The software and hardware overheads in sending andreceiving
`messages are largely determined by the communication mechanism andits
`implementation. Why is latency crucial? Latency affects both performance
`and how easy it is to program a multiprocessor. Unless latency is hidden,it
`directly affects performance either by tying up processor resources or by
`causing the processor to wait. Overhead and occupancyare closely related,
`
`
`
`6.1
`
`Introduction
`
`»
`
`535
`
`since many formsof overhead also tie up somepart of the node,incurring an
`occupancycost, which in turn limits bandwidth. Key features of a communi-
`cation mechanism maydirectly affect overhead and occupancy. For example,
`howis the destination address for a remote communication named, and how
`is protection implemented? When naming and protection mechanisms are
`provided by the processor, as in a shared address space, the additional over-
`head is small. Alternatively, if these mechanisms must be provided by the
`operating system for each communication,this increases the overhead and
`occupancy costs of communication, which in turn reduce bandwidth and
`increase latency.
`3. Communication latency hiding—How well can the communication mecha-
`nism hide latency by overlapping communication with computation or with
`other communication? Although measuring this is not as simple as measuring
`the first two metrics, it is an important characteristic that can be quantified by
`measuring the running time on multiprocessors with the same communication
`latency but different support for latency hiding. We will see examples of
`latency-hiding techniques for shared memory in Sections 6.8 and 6.10.
`Althoughhidinglatency is certainly a goodidea, it poses an additional burden
`on the software system and ultimately on the programmer. Furthermore,the
`amountof latency that can be hidden is application dependent. Thus,it is usu-
`ally best to reduce latency whereverpossible.
`
`Each of these performance measuresis affected by the characteristics of the
`communications needed in the application. The size of the data items being
`communicated is the most obvious, since it affects both latency and bandwidth
`in a direct way, as well as affecting the efficacy of different latency-hiding
`approaches. Similarly, the regularity in the communication patterns affects the
`cost of naming and protection, and hence the communication overhead. In gen-
`eral, mechanismsthat perform well with smaller as well as larger data communi-
`cation requests, and irregular as well as regular communication patterns, are
`moreflexible and efficient for a wider class of applications. Of course, in consid-
`ering any communication mechanism, designers must consider cost as well as
`performance.
`
`Advantagesof Different Communication Mechanisms
`
`Each of these two primary communication mechanismshas its advantages. For
`shared-memory communication, advantages include
`
`= Compatibility with the well-understood mechanisms in use in centralized
`multiprocessors, which all use shared-memory communication. The OpenMP
`consortium (see www.openmp.org for description) has proposed a standard-
`ized programming interface for shared-memory multiprocessors.
`= Ease of programming when the communication patterns among processors
`are complex or vary dynamically during execution. Similar advantages sim-
`plify compiler design.
`
`
`
`536
`
`Chapter Six Multiprocessors and Thread-Level Parallelism
`
`=
`
`=
`
`=
`
`Theability to develop applications using the familiar shared-memory model,
`focusing attention only on those accesses that are performancecritical.
`Loweroverhead for communication and better use of bandwidth when com-
`municating small items. This arises from the implicit nature of communica-
`tion and the use of memory mapping to implement protection in hardware,
`rather than through the I/O system.
`Theability to use hardware-controlled caching to reduce the frequency of
`remote communication by supporting automatic caching of all data, both
`shared andprivate. As wewill see, caching reduces both latency and conten-
`tion for accessing shared data. This advantage also comes with a disadvan-
`tage, which we mention below.
`
`The major advantages for message-passing communication include the following:
`
`m= The hardware can be simpler, especially by comparison with a scalable shared-
`memory implementation that supports coherent caching of remote data.
`in
`= Communication is explicit, which means it
`is simpler to understand;
`shared-memory models, it can be difficult to know when communication is
`occurring and whenit is not, as well as how costly the communication1s.
`= Explicit communication focuses programmerattention on this costly aspect
`of parallel computation, sometimes leading to improvedstructure in a multi-
`processor program.
`= Synchronization is naturally associated with sending messages, reducing the
`possibility for errors introduced by incorrect synchronization.
`It makes it easier to use sender-initiated communication, which may have
`some advantages in performance.
`
`=
`
`Of course, the desired communication model can be created on top of a
`hardware model that supports either of these mechanisms. Supporting message
`passing on top of shared memory is considerably easier: Because messages
`essentially send data from one memory to another, sending a message can be
`implemented by doing a copy from one portion of the address space to another.
`The major difficulties arise from dealing with messages that may be misaligned
`and of arbitrary length in a memory system that is normally oriented toward
`transferring aligned blocks of data organized as cache blocks. Thesedifficulties
`can be overcome either with small performance penalties in software or with
`essentially no penalties, using a small amount of hardware support.
`Supporting shared memoryefficiently on top of hardware for message pass-
`ing is much more difficult. Without explicit hardware support for shared mem-
`ory, all shared-memory references need to involve the operating system to
`provide address translation and memoryprotection, as well as to translate mem-
`ory references into message sends and receives. Loads and stores usually move
`small amounts of data, so the high overhead of handling these communications
`in software severely limits the range of applications for which the performance
`
`
`
`6.1
`
`Introduction
`
`» 537
`
`of software-based shared memory is acceptable. An ongoing area of research is
`the exploration of when a software-based model is acceptable and whether a
`software-based mechanism is usable for the highest level of communicationin a
`hierarchically structured system. One possible direction is the use of virtual
`memory mechanisms to share objects at
`the page level, a technique called
`shared virtual memory, we discuss this approach in Section 6.10.
`In distributed-memory multiprocessors, the memory model and communica-
`tion mechanismsdistinguish the multiprocessors. Originally, distributed-memory
`multiprocessors were built with message passing,since it was Clearly simpler and
`manydesigners and researchers did not believe that a shared address space could
`be built with distributed memory. Shared-memory communication has beensup-
`ported in virtually every multiprocessor designed since 1995. What hardware
`communication mechanisms will be supported in the very largest multiproces-
`sors, called massively parallel processors (MPPs), which typically have far more
`than 100 processors, is unclear; shared memory, message passing, and hybrid
`approachesare all contenders. Despite the symbolic importance of the MPPs,
`such multiprocessorsare a small portion of the market and havelittle or no influ-
`ence on the mainstream multiprocessors with tens of processors. Wewill return to
`a discussion of the possibilities and trends for MPPsin the concluding remarks
`and historical perspective at the end of this chapter.
`SMPs, which wefocus on in Section 6.3, vastly dominate DSM multiproces-
`sors in terms of market size (both units and dollars) and will probably be the
`architecture of choice for on-chip multiprocessors. For moderate-scale multipro-
`cessors (> 8 processors) long-term technical trends favor distributing memory,
`whichis also likely to be the dominant approach when on-chip SMPsare used as
`the building blocks in the future. These distributed shared-memory multiproces-
`sors are a natural extension of the centralized multiprocessors that dominate the
`market, so we discuss these architectures in Section 6.5. In contrast, multicom-
`puters or message-passing multiprocessors build on advances in network technol-
`ogy and are described in Chapter 8. Since the technologies employed were well
`described in the last chapter, we focus our attention on shared-memory
`approachesin the rest of this chapter.
`
`Challengesof Parallel Processing
`
`Two important hurdles, both explainable with Amdahl’s Law, makeparallel pro-
`cessing challenging. Thefirst has to do with the limited parallelism available in
`programs, and the secondarises from the relatively high cost of communications.
`Limitations in available parallelism makeit difficult to achieve good speedupsin
`any parallel processor, as ourfirst example shows.
`
`SOOO———————
`
`Example
`
`Suppose you wantto achieve a speedup of 80 with 100 processors. Whatfraction
`of the original computation can be sequential?
`
`
`
`538 «= Chapter Six Multiprocessors and Thread-Level Parallelism
`
`Answer Amdahl’s Law is
`Speedup =pilLEee
`;___enhanced 5 (4 _ Fraction.phanced)
`Speedupenhanced
`For simplicity in this example, assume that the program operates in only two
`modes: parallel with all processors fully used, which is the enhanced mode, or
`serial with only one processor in use. With this simplification, the speedup in
`enhanced mode is simply the number of processors, while the fraction of
`enhanced modeis the time spent in parallel mode. Substituting into the previous
`equation:
`
`Fraction
`
`O=
`
`1SL
`
`Fraction jarattel7 + (1 — Fractionparattel )
`
`Simplifying this equation yields
`0.8 x Fraction
`
`paralle
`
`, + 80 x (1 - Fraction,arate!) = l
`80 — 79.2 x Fractionpsratlel = 1
`:
`_ 80-1
`Fractionparallel = 79.2
`Fractionorallel = 0.9975
`
`Thus to achieve a speedup of 80 with 100 processors, only 0.25% of original
`computation can be sequential. Of course, to achieve linear speedup (speedup of
`n with n processors), the entire program must usually be parallel with noserial
`portions. (One exception to this is superlinear speedup that occurs due to the
`increased memory and c