throbber
EDITION
`
`

`

`
`
`ComputerArchitecture
`A Quantitative Approach
`
`Third Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University ofCalifornia at Berkeley
`
`With Contributions by
`
`David Goldberg
`Xerox Palo Alto Research Center
`
`Krste Asanovic
`Department ofElectrical Engineering and Computer Science
`Massachusetts Institute oftechnology
`
`mM 7d
`
`MORGAN KAUFMANN PUBLISHERS
`
`AN IMPRINT OF ELSEVIER SCIENCE
`AMSTERDAM BOSTON
`LONDON
`NEW YORK
`OXFORD
`PARIS
`SAN DIEGO
`SAN FRANCISCO
`SINGAPORE
`SYDNEY
`TOKYO
`
`

`

`Senior Editor Denise E. M. Penrose
`Assistant Publishing Services Manager Edward Wade
`Senior Production Editor Cheri Palmer
`Editorial Coordinator Alyson Day
`Cover Design Ross Carron Design
`Cover Image Greg Pease / gettyimages
`Text Design Rebecca Evans & Associates
`Technical Illustration Lineworks,Inc.
`Composition Nancy Logan
`Copyeditor Ken DellaPenta
`Proofreader
`Jennifer McClain
`Indexer Ty Koontz
`Printer Courier Corporation
`Designations used by companiesto distinguish their products are often claimed as trademarks or reg-
`istered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the
`product names appearin initial capital or all capital letters. Readers, however, should contact the
`appropriate companies for more complete information regarding trademarks and registration.
`
`Morgan Kaufmann Publishers
`An Imprint of Elsevier Science
`340 Pine Street, Sixth Floor, San Francisco, CA 94104-3205, USA
`www.mkp.com
`
`© 1990, 1996, 2003 by Elsevier Science (USA)
`All rights reserved
`
`Published 1990. Third edition 2003
`Printed in the United States of America
`07
`06
`05
`04
`03
`10
`9
`8
`
`7 6.5 4.93 2.1
`
`Nopart ofthis publication may be reproduced,storedin a retrieval system,or transmitted in any form
`or by any means—electronic, mechanical, photocopying, or otherwise—withoutthe prior written per-
`mission of the publisher.
`
`ADVICE, PRAISE, & ERRORS:Any correspondencerelated to this publication or intended for the
`authors should be addressed to ca3comments@mkp.com. Information regarding errorsightings is also
`encouraged. Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction upon availability of the new
`printing. Bugs canbe sent to cabugs@mkp.com.(Please include your full name and permanent mail-
`ing address.)
`
`Library of Congress Control Number: 2001099789
`
`ISBN 1-55860-596-7 (cloth)
`ISBN 1-55860-724-2 (paper)
`
`This bookis printed on acid-free paper.
`
`

`

`528 «= Chapter Six Multiprocessors and Thread-Level Parallelism
`
`Introduction
`
`As the quotations that open this chapter show, the view that advances in uni-
`processorarchitecture were nearing an end has been widely held at varying times.
`To counter this view, we observe that during the period 1985-2000, eeae
`performance growth, driven by the microprocessor, was at its highest rate since
`thefirst transistorized computersin the late 1950s and early 1960s.
`On balance, though, we believe that parallel processors will definitely have a
`bigger role in the future. This view is driven by three observations. First, since
`microprocessorsare likely to remain the dominant uniprocessor technology, the
`logical way to improve performance beyonda single processor is by connecting
`multiple microprocessors together. This combination is likely to be morecost-
`effective than designing a custom processor. Second, it is unclear whether the
`pace ofarchitectural innovation that has been based for more than 15 years on
`increased exploitation of instruction-level parallelism can be sustained indefi-
`nitely. As we saw in Chapters 3 and 4, modern multiple-issue processors have
`become incredibly complex, and the performance increases achieved through
`increasing complexity,
`increasing silicon, and increasing power seem to be
`diminishing. Third, there appears to be slow but steady progress on the major
`obstacle to widespread use ofparallel processors, namely, software. This progress
`is probably faster in the server and embedded markets, as we discussed in Chap-
`ters 3 and 4. Server and embedded applications exhibit natural parallelism that
`can be exploited without some of the burdens of rewriting a gigantic software
`base. This is more of a challenge in the desktop space.
`We, however, are extremely reluctant to predict the death of advancesin uni-
`processor architecture. Indeed, we believe that the rapid rate of performance
`growth will continueat least for the next five years. Whether this pace of innova-
`tion can besustained longeris difficult to predict but hard to bet against. None-
`theless, if the pace of progress in uniprocessors does slow down, multiprocessor
`architectures will become increasingly attractive.
`Thatsaid, weare left with two problems. First, multiprocessor architecture is
`a large anddiversefield, and muchofthe field is in its youth, with ideas coming
`and going and, until very recently, more architectures failing than succeeding.
`Giventhat we are already on page 528, full coverage of the multiprocessor design
`space andits trade-offs would require another volume. (Indeed, Culler, Singh,
`and Gupta [1999] cover only multiprocessors in their 1000-page book!) Second,
`such coverage would necessarily entail discussing approaches that may notstand
`the test of time, something we have largely avoided to this point. For these rea-
`sons, we have chosen to focus on the mainstream of multiprocessor design: mul-
`tiprocessors with small to medium numbers of processors (< 128). Such designs
`vastly dominate in terms of both units and dollars. We will pay only slightatten-
`tion to the larger-scale multiprocessor design space (= 128 processors). At the
`present, the future architecture of such multiprocessors is unsettled, and even the
`viability of that marketplace is in doubt. We will return to this topic briefly at the
`end of the chapter, in Section 6.15.
`
`

`

`6.1
`
`Introduction
`
`» 529
`
`A TaxonomyofParallel Architectures
`
`We begin this chapter with a taxonomy so that you can appreciate both the
`breadth of design alternatives for multiprocessors and the contextthathasled to
`the development of the dominant form of multiprocessors. Webriefly describe
`the alternatives and the rationale behind them; a longer description of how these
`different models were born (and often died) can be found in the historical per-
`spective at the endof the chapter.
`The idea of using multiple processors both to increase performance and to
`improve availability dates back to the earliest electronic computers. About 30
`years ago, Flynn [1966] proposed a simple model ofcategorizing all computers
`thatis still useful today. He looked at the parallelism in the instruction and data
`streams called for by the instructions at the most constrained componentof the
`multiprocessor, and placed all computers into one of four categories:
`
`1. Single instruction stream, single data stream (SISD)—Thiscategory is the
`uniprocessor.
`2. Single instruction stream, multiple data streams (SIMD)—Thesameinstruc-
`tion is executed by multiple processors using different data streams. Each
`processorhasits own data memory (hence multiple data), but there is a single
`instruction memory and control processor, which fetches and dispatches
`instructions. The multimedia extensions we considered in Chapter 2 are a
`limited form of SIMDparallelism. Vector architecturesare the largest class of
`processorsof this type.
`3. Multiple instruction streams, single data stream (MISD)—No commercial
`multiprocessor of this type has been built to date, but may bein the future.
`Somespecial-purpose stream processors approximate a limited form ofthis
`(there is only a single data stream that is operated on by successive functional
`units).
`4. Multiple instruction streams, multiple data streams (MIMD)—Eachproces-
`sor fetches its own instructions and operates on its own data. The processors
`are often off-the-shelf microprocessors.
`
`This is a coarse model, as some multiprocessors are hybrids of these categories.
`Nonetheless, it is useful to put a framework on the design space.
`Asdiscussedin the historical perspectives, many of the early multiprocessors
`were SIMD, and the SIMD modelreceived renewedattention in the 1980s, and
`except for vector processors, was gone by the mid-1990s. MIMDhasclearly
`emerged as the architecture of choice for general-purpose multiprocessors. Two
`factors are primarily responsible for the rise of the MIMD multiprocessors:
`
`1. MIMDsoffer flexibility, With the correct hardware and software support,
`MIMDscanfunction as single-user multiprocessors focusing on high perfor-
`mance for one application, as multiprogrammed multiprocessors running
`manytasks simultaneously, or as some combination of these functions.
`
`

`

`530 « Chapter Six Multiprocessors and Thread-Level Parallelism
`
`2. MIMDs can build on the cost-performance advantages of off-the-shelf
`microprocessors.In fact, nearly all multiprocessors built today use the same
`microprocessors found in workstations and single-processor Servers.
`
`With an MIMD,each processor is executing its own instruction stream. In
`many cases, each processor executes a different process. Recall from the last
`chapter that a process is a segment of code that may be run independently, and
`that the state of the process containsall the information necessary to executethat
`program onaprocessor. In a multiprogrammed environment, where the proces-
`sors may be running independenttasks, each process is typically independent of
`the processes on other processors.
`It is also useful to be able to have multiple processors executing a single pro-
`gram and sharing the code and mostof their address space. When multiple pro-
`cesses share code and data in this way, they are often called threads. Today, the
`term threadis often used in a casual wayto refer to multiple loci of executionthat
`may run on different processors, even when they do not share an addressspace.
`To take advantage of an MIMD multiprocessor with n processors, we must
`usually have at least n threads or processes to execute. The independent threads
`are typically identified by the programmeror created by the compiler. Since the
`parallelism in this situation is contained in the threads, it is called thread-level
`parallelism.
`independent processes—for example,
`Threads may vary from large-scale,
`independent programs running in a multiprogrammedfashion on different proces-
`sors—to parallel iterations of a loop, automatically generated by a compiler and
`each executing for perhapsless than a thousandinstructions. Although the size of a
`thread is important in considering how to exploit thread-level parallelism effi-
`ciently, the important qualitative distinction is that such parallelism is identified at
`a high level by the software system and that the threads consist of hundreds to mil-
`lions of instructions that may be executedin parallel. In contrast, instruction-level
`parallelism is identified primarily by the hardware, although with software help in
`some cases, and is found and exploited one instruction at a time.
`Existing MIMD multiprocessorsfall into two classes, depending on the num-
`ber of processors involved, which in turn dictate a memory organization and
`interconnectstrategy. Werefer to the multiprocessors by their memory organiza-
`tion because what constitutes a small or large number of processorsis likely to
`changeovertime.
`The first group, which wecall centralized shared-memory architectures, has
`at most a few dozenprocessors in 2000. For multiprocessors with small processor
`counts, it is possible for the processors to share a single centralized memory and
`to interconnect the processors and memory by a bus. With large caches, the bus
`and the single memory, possibly with multiple banks, can satisfy the memory
`demandsof a small numberof processors. By replacing a single bus with multi-
`ple buses, or even a switch, a centralized shared-memory design can be scaled to
`a few dozen processors. Although scaling beyondthat is technically conceivable,
`sharing a centralized memory, even organized as multiple banks, becomes less
`attractive as the numberof processors sharingit increases,
`
`
`
`

`

`6.1
`
`Introduction
`
`=
`
`531
`
`Becausethere is a single main memory that has a symmetric relationship to
`all processors and a uniform access time from any processor, these multiproces-
`sors are often called symmetric (shared-memory) multiprocessors (SMPs), and
`this style of architecture is sometimes called uniform memory access (UMA),
`This type of centralized shared-memory architecture is currently by far the most
`popular organization. Figure 6.1 shows what these multiprocessorslooklike. The
`architecture of such multiprocessorsis the topic of Section 6.3.
`The second group consists of multiprocessors with physically distributed
`memory. To support larger processor counts, memory mustbedistributed among
`the processors rather than centralized; otherwise the memory system would not
`be able to support the bandwidth demandsofa larger number ofprocessors with-
`out incurring excessively long access latency. With the rapid increase in processor
`performance and the associated increase in a processor’s memory bandwidth
`requirements, the scale of multiprocessor for which distributed memory is pre-
`ferred over a single, centralized memory continues to decrease in number (which
`is another reason not to use small and large scale). Of course, the larger number
`of processors raises the need for a high bandwidth interconnect, of which wewill
`see examples in Chapter 8. Both direct interconnection networks(i.e., switches)
`and indirect networks (typically multidimensional meshes) are used. Figure 6.2
`showswhatthese multiprocessors looklike.
`
`Processor
`
`Processor
`
`Processor
`
`Processor
`
`of cache
`
`One or
`morelevels
`of cache
`
`One or
`more levels
`of cache
`
`One or
`morelevels
`of cache
`
`Oneor
`morelevels
`
`Main memory
`
`VO system
`
`
`
`Figure 6.1 Basic structure of a centralized shared-memory multiprocessor. Multiple
`processor-cache subsystemsshare the same physical memory, typically connected by a
`bus.In larger designs, multiple buses, or even a switch may be used, but the key archi-
`tectural property—uniform accesstime to all memoryfrom all processors—remains.
`
`

`

`Bae
`
`Chapter Six Multiprocessors and Thread-Level Parallelism
`
` Processor
`+ cache
`
`
`
`
`vo
`+ cache
`
`+ cache
`
`+ cache
`
`Figure 6.2 The basic architecture of a distributed-memory multiprocessorconsists
`of individual nodes containing a processor, some memory, typically some /0, and
`an interface to an interconnection network that connectsall the nodes.Individual
`nodes maycontain a small number of processors, which may be interconnected by a
`small bus ora different interconnection technology, whichis less scalable than the glo-
`bal interconnection network.
`
`Distributing the memory amongthe nodes has two majorbenefits. First, it is a
`cost-effective way to scale the memory bandwidth if most of the accesses are to
`the local memory in the node. Second, it reduces the latency for accessesto the
`local memory. These two advantages make distributed memory attractive at
`smaller processor counts as processors get ever faster and require more memory
`bandwidth and lower memory latency. The key disadvantage for a distributed-
`memory architecture is that communicating data between processors becomes
`somewhat more complex andhashigherlatency, at least when there is no conten-
`tion, because the processors no longer share a single, centralized memory. As we
`will see shortly, the use of distributed memory leads to two different paradigms
`for interprocessor communication.
`Typically, I/O as well as memory is distributed among the nodesof the multi-
`processor, and the nodes may be small SMPs(twoto eight processors). Although
`the use of multiple processors in a node together with a memory and a network
`interface may be quite useful from a cost-efficiency viewpoint, it is not funda-
`mental to how these multiprocessors work, and so we will focus on the one-
`processor-per-node design for most of this chapter.
`
`Models for Communication and Memory Architecture
`
`As discussed earlier, any large-scale multiprocessor must use multiple memories
`that are physically distributed with the processors. There are two alternative
`architectural approachesthat differ in the method used for communicating data
`among processors.
`
`

`

`6.1
`
`Introduction
`

`
`533
`
`In the first method, communication occurs through a shared address Space,
`That is,
`the physically separate memories can be addressed as onelogically
`shared address space, meaning that a memory reference can be madeby any pro-
`cessor to any memory location, assumingit has the correct access rights. These
`multiprocessors are called distributed shared-memory (DSM)architectures. The
`term shared memory refers to the fact that the address spaceis shared; thatis, the
`same physical address on two processors refers to the samelocation in memory,
`Shared memory does not meanthatthere is a single, centralized memory. In con-
`trast to the symmetric shared-memory multiprocessors, also known as UMAs
`(uniform memory access), the DSM multiprocessors are also called NUMAs
`(nonuniform memory access), since the access time depends on thelocation ofa
`data word in memory.
`Alternatively, the address space can consist of multiple private address spaces
`that are logically disjoint and cannot be addressed by a remote processor. In such
`multiprocessors, the same physical address on two different processors refers to
`two different locations in two different memories. Each processor-memory mod-
`ule is essentially a separate computer; therefore, these parallel processors have
`been called multicomputers. A multicomputer can even consist of completely
`separate computers connected ona local area network, which today are popularly
`called clusters. For applications that require little or no communication and can
`make use of separate memories, such clusters of processors, whether using a
`standardized or customized interconnect, can form a very cost-effective approach
`(see Section 8.10).
`With each of these organizationsfor the address space, there is an associated
`communication mechanism. For a multiprocessor with a shared address space,
`that address space can be used to communicate data implicitly via load and store
`operations; hence the name shared memory for such multiprocessors. For a multi-
`processorwith multiple address spaces, communication of data is done by explic-
`itly passing messages amongthe processors. Therefore, these multiprocessorsare
`often called message-passing multiprocessors.
`In message-passing multiprocessors, communication occurs by sending mes-
`sages that request action or deliver data, just as with the network protocols dis-
`cussed in Section 8.2. For example, if one processor wants to access or operate on
`data in a remote memory,it can send a message to request the data or to perform
`some operation on the data. In such cases, the message can be thoughtof as a
`remote procedure call (RPC). When the destination processor receives the mes-
`sage, either by polling for it or via an interrupt, it performs the operation or
`access on behalf of the remote processor and returnsthe result with a reply mes-
`sage. This type of messagepassingis also called synchronous, since the initiating
`processor sends a request and waits until the reply is returned before continuing.
`Software systems have been constructed to encapsulate the details of sending and
`receiving messages, including passing complex argumentsor return values,pre-
`senting a clean RPCfacility to the programmer.
`Communication can also occur from the viewpointof the writer of data rather
`than the reader, and this can be moreefficient when the processor producing data
`
`

`

`534 a Chapter Six Multiprocessors and Thread-Level Parallelism
`
`knows which other processors will need the data. In such cases, the data can be
`sent directly to the consumer process without having to be requested first. It is
`often possible to perform such message sends asynchronously, allowing the
`senderprocessto continue immediately. Often the receiver will wantto block if it
`tries to receive the messagebefore it has arrived; in other cases, the reader may
`check whether a message is pending before actually trying to perform a blocking
`receive. Also the sender must be prepared to block if the receiver has not yet con-
`sumed an earlier message and no buffer space is available. The message-passing
`facilities offered in different multiprocessors are fairly diverse. To ease program
`portability, standard message-passing libraries (for example, message-passing
`interface, or MPI) have been proposed. Suchlibraries sacrifice some performance
`to achieve a commoninterface.
`
`Performance Metrics for Communication Mechanisms
`
`Three performancemetricsarecritical in any communication mechanism:
`
`1. Communication bandwidth—Ideally the communication bandwidth is lim-
`ited by processor, memory, and interconnection bandwidths, rather than by
`some aspect of the communication mechanism. The bisection bandwidth
`(see Section 8.5) is determined by the interconnection network. The band-
`width in or out of a single node, which is often as importantas bisection
`bandwidth, is affected both by the architecture within the node and by the
`communication mechanism. How does the communication mechanism affect
`the communication bandwidth of a node? When communication occurs,
`resources within the nodes involved in the communication are tied up or
`occupied, preventing other outgoing or incoming communication. Whenthis
`occupancyis incurred for each word of a message, it sets an absolute limit
`on the communication bandwidth. This limit is often lower than whatthe
`network or memory system can provide. Occupancy may also have a compo-
`nent that is incurred for each communication event, such as an incoming or
`outgoing request. In the latter case, the occupancy limits the communication
`rate, and the impact of the occupancy on overall communication bandwidth
`dependsonthe size of the messages.
`2. Communication latency—Ideally the latency is as low as possible. As we will
`see in Chapter 8,
`
`Communication latency = Sender overhead + Timeofflight
`+ Transmission time + Receiver overhead
`
`Timeofflight is fixed and transmission time is determined by the interconnec-
`tion network. The software and hardware overheads in sending andreceiving
`messages are largely determined by the communication mechanism andits
`implementation. Why is latency crucial? Latency affects both performance
`and how easy it is to program a multiprocessor. Unless latency is hidden,it
`directly affects performance either by tying up processor resources or by
`causing the processor to wait. Overhead and occupancyare closely related,
`
`

`

`6.1
`
`Introduction
`

`
`535
`
`since many formsof overhead also tie up somepart of the node,incurring an
`occupancycost, which in turn limits bandwidth. Key features of a communi-
`cation mechanism maydirectly affect overhead and occupancy. For example,
`howis the destination address for a remote communication named, and how
`is protection implemented? When naming and protection mechanisms are
`provided by the processor, as in a shared address space, the additional over-
`head is small. Alternatively, if these mechanisms must be provided by the
`operating system for each communication,this increases the overhead and
`occupancy costs of communication, which in turn reduce bandwidth and
`increase latency.
`3. Communication latency hiding—How well can the communication mecha-
`nism hide latency by overlapping communication with computation or with
`other communication? Although measuring this is not as simple as measuring
`the first two metrics, it is an important characteristic that can be quantified by
`measuring the running time on multiprocessors with the same communication
`latency but different support for latency hiding. We will see examples of
`latency-hiding techniques for shared memory in Sections 6.8 and 6.10.
`Althoughhidinglatency is certainly a goodidea, it poses an additional burden
`on the software system and ultimately on the programmer. Furthermore,the
`amountof latency that can be hidden is application dependent. Thus,it is usu-
`ally best to reduce latency whereverpossible.
`
`Each of these performance measuresis affected by the characteristics of the
`communications needed in the application. The size of the data items being
`communicated is the most obvious, since it affects both latency and bandwidth
`in a direct way, as well as affecting the efficacy of different latency-hiding
`approaches. Similarly, the regularity in the communication patterns affects the
`cost of naming and protection, and hence the communication overhead. In gen-
`eral, mechanismsthat perform well with smaller as well as larger data communi-
`cation requests, and irregular as well as regular communication patterns, are
`moreflexible and efficient for a wider class of applications. Of course, in consid-
`ering any communication mechanism, designers must consider cost as well as
`performance.
`
`Advantagesof Different Communication Mechanisms
`
`Each of these two primary communication mechanismshas its advantages. For
`shared-memory communication, advantages include
`
`= Compatibility with the well-understood mechanisms in use in centralized
`multiprocessors, which all use shared-memory communication. The OpenMP
`consortium (see www.openmp.org for description) has proposed a standard-
`ized programming interface for shared-memory multiprocessors.
`= Ease of programming when the communication patterns among processors
`are complex or vary dynamically during execution. Similar advantages sim-
`plify compiler design.
`
`

`

`536
`
`Chapter Six Multiprocessors and Thread-Level Parallelism
`
`=
`
`=
`
`=
`
`Theability to develop applications using the familiar shared-memory model,
`focusing attention only on those accesses that are performancecritical.
`Loweroverhead for communication and better use of bandwidth when com-
`municating small items. This arises from the implicit nature of communica-
`tion and the use of memory mapping to implement protection in hardware,
`rather than through the I/O system.
`Theability to use hardware-controlled caching to reduce the frequency of
`remote communication by supporting automatic caching of all data, both
`shared andprivate. As wewill see, caching reduces both latency and conten-
`tion for accessing shared data. This advantage also comes with a disadvan-
`tage, which we mention below.
`
`The major advantages for message-passing communication include the following:
`
`m= The hardware can be simpler, especially by comparison with a scalable shared-
`memory implementation that supports coherent caching of remote data.
`in
`= Communication is explicit, which means it
`is simpler to understand;
`shared-memory models, it can be difficult to know when communication is
`occurring and whenit is not, as well as how costly the communication1s.
`= Explicit communication focuses programmerattention on this costly aspect
`of parallel computation, sometimes leading to improvedstructure in a multi-
`processor program.
`= Synchronization is naturally associated with sending messages, reducing the
`possibility for errors introduced by incorrect synchronization.
`It makes it easier to use sender-initiated communication, which may have
`some advantages in performance.
`
`=
`
`Of course, the desired communication model can be created on top of a
`hardware model that supports either of these mechanisms. Supporting message
`passing on top of shared memory is considerably easier: Because messages
`essentially send data from one memory to another, sending a message can be
`implemented by doing a copy from one portion of the address space to another.
`The major difficulties arise from dealing with messages that may be misaligned
`and of arbitrary length in a memory system that is normally oriented toward
`transferring aligned blocks of data organized as cache blocks. Thesedifficulties
`can be overcome either with small performance penalties in software or with
`essentially no penalties, using a small amount of hardware support.
`Supporting shared memoryefficiently on top of hardware for message pass-
`ing is much more difficult. Without explicit hardware support for shared mem-
`ory, all shared-memory references need to involve the operating system to
`provide address translation and memoryprotection, as well as to translate mem-
`ory references into message sends and receives. Loads and stores usually move
`small amounts of data, so the high overhead of handling these communications
`in software severely limits the range of applications for which the performance
`
`

`

`6.1
`
`Introduction
`
`» 537
`
`of software-based shared memory is acceptable. An ongoing area of research is
`the exploration of when a software-based model is acceptable and whether a
`software-based mechanism is usable for the highest level of communicationin a
`hierarchically structured system. One possible direction is the use of virtual
`memory mechanisms to share objects at
`the page level, a technique called
`shared virtual memory, we discuss this approach in Section 6.10.
`In distributed-memory multiprocessors, the memory model and communica-
`tion mechanismsdistinguish the multiprocessors. Originally, distributed-memory
`multiprocessors were built with message passing,since it was Clearly simpler and
`manydesigners and researchers did not believe that a shared address space could
`be built with distributed memory. Shared-memory communication has beensup-
`ported in virtually every multiprocessor designed since 1995. What hardware
`communication mechanisms will be supported in the very largest multiproces-
`sors, called massively parallel processors (MPPs), which typically have far more
`than 100 processors, is unclear; shared memory, message passing, and hybrid
`approachesare all contenders. Despite the symbolic importance of the MPPs,
`such multiprocessorsare a small portion of the market and havelittle or no influ-
`ence on the mainstream multiprocessors with tens of processors. Wewill return to
`a discussion of the possibilities and trends for MPPsin the concluding remarks
`and historical perspective at the end of this chapter.
`SMPs, which wefocus on in Section 6.3, vastly dominate DSM multiproces-
`sors in terms of market size (both units and dollars) and will probably be the
`architecture of choice for on-chip multiprocessors. For moderate-scale multipro-
`cessors (> 8 processors) long-term technical trends favor distributing memory,
`whichis also likely to be the dominant approach when on-chip SMPsare used as
`the building blocks in the future. These distributed shared-memory multiproces-
`sors are a natural extension of the centralized multiprocessors that dominate the
`market, so we discuss these architectures in Section 6.5. In contrast, multicom-
`puters or message-passing multiprocessors build on advances in network technol-
`ogy and are described in Chapter 8. Since the technologies employed were well
`described in the last chapter, we focus our attention on shared-memory
`approachesin the rest of this chapter.
`
`Challengesof Parallel Processing
`
`Two important hurdles, both explainable with Amdahl’s Law, makeparallel pro-
`cessing challenging. Thefirst has to do with the limited parallelism available in
`programs, and the secondarises from the relatively high cost of communications.
`Limitations in available parallelism makeit difficult to achieve good speedupsin
`any parallel processor, as ourfirst example shows.
`
`SOOO———————
`
`Example
`
`Suppose you wantto achieve a speedup of 80 with 100 processors. Whatfraction
`of the original computation can be sequential?
`
`

`

`538 «= Chapter Six Multiprocessors and Thread-Level Parallelism
`
`Answer Amdahl’s Law is
`Speedup =pilLEee
`;___enhanced 5 (4 _ Fraction.phanced)
`Speedupenhanced
`For simplicity in this example, assume that the program operates in only two
`modes: parallel with all processors fully used, which is the enhanced mode, or
`serial with only one processor in use. With this simplification, the speedup in
`enhanced mode is simply the number of processors, while the fraction of
`enhanced modeis the time spent in parallel mode. Substituting into the previous
`equation:
`
`Fraction
`
`O=
`
`1SL
`
`Fraction jarattel7 + (1 — Fractionparattel )
`
`Simplifying this equation yields
`0.8 x Fraction
`
`paralle
`
`, + 80 x (1 - Fraction,arate!) = l
`80 — 79.2 x Fractionpsratlel = 1
`:
`_ 80-1
`Fractionparallel = 79.2
`Fractionorallel = 0.9975
`
`Thus to achieve a speedup of 80 with 100 processors, only 0.25% of original
`computation can be sequential. Of course, to achieve linear speedup (speedup of
`n with n processors), the entire program must usually be parallel with noserial
`portions. (One exception to this is superlinear speedup that occurs due to the
`increased memory and c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket