`
`Reference 42
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 1
`
`
`
`Multiprocessors
`
`Thus far we have treated methods for speeding up a single instruction stream.
`Although there is but a single program in execution, the designs discussed
`earlier exploit concurrency within the instruction stream and within individ-
`ual instructions. In this chapter we turn to the discussion of multiprocessors-
`computer systems composed of several independent processors. The mo-
`tivation for moving towards multiple processors is strictly a matter of
`performance because device technology places an upper bound on the speed
`of any single processor. To exceed that bound requires multiple processors.
`The central themes of this chapter are multiprocessor structures and
`performance. Our objective is to show several interesting techniques for or-
`ganizing multiple processors into highly parallel systems and to give insight
`into the potential performance improvements and bottlenecks of such sys-
`tems. Chapter 7 treats software strategies for using the available parallelism
`of these systems.
`
`278
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 2
`
`
`
`Sec. 6.1
`
`Background
`
`279
`
`6.1 Background
`Our earlier discussions of high-performance machines study two important
`classes of parallelism. Pipeline machines produce high performance by plac-
`ing several stages of a pipeline in operation simultaneously. Machines for
`continuum calculations have multiple processors, each executing the same
`program. In both cases, a single program is used to operate on vectors or
`arrays of data. Flynn [1966] termed this type of parallelism single-instruction
`stream, multiple-data stream (SIMD) parallelism. Recall, for example, an ex-
`treme implementation of this idea in the form of the GF-11, ih which each of
`576 processors executes identical instructions broadcast to them by a single
`control unit.
`Another SIMD machine with massive parallelism is the Connection Ma-
`chine [Hillis 1986] with 64K 1-bit processors. The architect is truly fortunate
`when an application can be executed on machines that are built around the
`lock-step parallelism required for SIMD machines because the architecture
`efficiently executes programs well suited to SIMD execution.
`High performance on such machines requires rewriting conventional al-
`gorithms to manipulate many data simultaneously by means of instructions
`broadcast to all processors. Although programming for these machines can
`be difficult in principle, in the ideal case, a serial algorithm can be converted
`to an SIMD algorithm by replacing each inner loop with a single broadcast
`instruction that implements the complete loop. The fact that an important,
`but limited, class of problems fits this model extremely well has provided the
`impetus for the design and construction of these machines.
`Clearly, some large problems do not lend themselves to efficient exe-
`cution in an SIMD architecture. The operations required for such problems
`cannot easily be organized into repetitive operations on uniformly structured
`data. They tend to be unstructured and unpredictable. Addressing patterns
`tend to be data dependent, so the architecture cannot easily preload data by
`anticipating future accesses.
`The architect who must attain high performance for such problems inevi-
`tably looks for a solution in a multiprocessor structure. Such an architecture
`is composed of several independent computers, each capable of executing its
`own program. Flynn [1966] calls this type of architecture multiple-instruction
`stream, multiple-data stream, (MIMD) architecture. The processors of a multi-
`processor are interconnected in some fashion to permit programs to exchange
`data and synchronize activities.
`A model of such an architecture is shown in Fig. 6.1. In this figure each
`processor has registers, arithmetic and logic units, and access to memory and
`input/output modules. In Fig. 6. l(a) we show the memory and input/output
`systems as separate subsystems shared among all of the processors. Figure
`6.l(b) shows the memory and input/output units attached to individual pro-
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 3
`
`
`
`280
`
`Multiprocessors
`
`Chap. 6
`
`Processor 1
`
`Processor 2
`
`Processor N
`
`Processor 1
`
`Processor 2
`
`Processor N
`
`Memory
`
`Memory
`
`Memory
`
`1/0
`
`1/0
`
`Interconnection
`Network
`
`Interconnection
`Network
`
`(a)
`
`Memory
`
`Memory
`
`Memory
`
`1/0
`
`1/0
`
`1/0
`
`(b)
`Fig. 6.1 Two multiprocessor structures:
`(a) All memory and 110 are remote and shared; and
`(b) All memory and 1/0 are local and private.
`
`cessors. No sharing of memory and input/output is permitted in Fig. 6.l(b). In
`both cases, because the system contains multiple processors, each capable of
`executing an independent program, the system fits Flynn's MIMD model.
`In both systems depicted in Fig. 6.1 the processors cooperate by ex-
`changing data through the interconnection system and by synchronizing
`activities. The shared memory in Fig. 6.l(a) provides a convenient means for
`information interchange and synchronization since any pair of processors
`can communicate through a shared location. The structure in Fig. 6.l(b)
`supports communication through point-to-point exchange of information.
`Obviously, multiprocessors can have any reasonable combination of shared
`global memory or private local memory. Fig. 6.1 shows the extremes in the
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 4
`
`
`
`Sec. 6.1
`
`Background
`
`281
`
`design space, and practical designs lie at the extremes or anywhere in be-
`tween.
`The main purpose of a high-speed multiprocessor is to complete a job
`faster by using several machines concurrently than can be done by using a
`single copy of the same machine. In some applications, the main purpose for
`using multiple processors is for reliability rather than high performance. The
`idea is that if any single processor fails, its workload can be performed by
`other processors in the system. Since the design principles of such systems
`are quite different from the principles that guide the design of high-
`performance systems, we do not address design for reliability in this text, but
`rather we limit our attention to issues related to performance.
`When a multiprocessor is operating at peak performance, all processors
`are engaged in useful work. No processor is idle, and no processor is executing
`an instruction that would not be executed if the same algorithm were exe-
`cuting on a single processor. In this state of peak performance, all N pro-
`cessors of a multiprocessor are contributing to effective performance, and the
`processing rate is increased by a factor of N.
`Peak performance is a very special state that is rarely achievable. There
`are several factors that introduce inefficiency. Among the factors are:
`• The delays introduced by interprocessor communications;
`• The overhead in synchronizing the work of one processor with another;
`• Lost efficiency when one or more processors run out of tasks;
`• Lost efficiency due to wasted effort by one or more processors; and
`• The processing costs for controlling the system and scheduling oper-
`ations.
`Both scheduling and synchronization are sources of overhead on serial
`machines. In citing these factors together with the other factors, we are citing
`how they degrade multiprocessor performance beyond the effects that may
`already be present on individual processors.
`A high-performance vector processor is free from many of the problems,
`but it does suffer from lost performance because it is unable to keep all of the
`processing units busy. This latter problem arises particularly when a
`computation is not easily implemented as a sequence of vector operations
`performed on highly structured, densely stored data.
`The architect who designs and builds a multiprocessor must pay close
`attention to the sources of inefficiency exposed here. They can lead to serious
`degradation in performance. For example, if the combined inefficiencies pro-
`duce an effective processing rate of only ten percent of the peak rate, then ten
`processors are required in a multiprocessor system just to do the work of a
`single processor.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 5
`
`
`
`282
`
`Multiprocessors
`
`Chap.6
`
`Fortunately, for a small number of processors, careful design can hold the
`inefficiency to a low figure, but inefficiencies tend to climb as the number of
`processors increase. There is a point where adding additional processors can
`lengthen, not shorten, computation time.
`The fact that inefficiency tends to grow with the number of processors is
`the underlying reason why many commercial offerings of multiprocessors
`have a small number of processors, such as 4, 8, or 16. The fastest machines
`are built from the fastest devices available and have relatively few processors.
`Consider, for example, the Cray XMP, a four-processor version of the Cray
`I. Another example is the IBM 309X family for which from one to six pro-
`cessor systems are available. Both of these implementations start with very
`high-speed devices and use architectural techniques such as cache and pipe-
`lining to produce very high-performance single processors for their respective
`markets.
`Users of these machines may have workloads or individual problems
`whose needs exceed the capacity of a single machine. Additional performance
`is not readily available from faster versions of the same machine because the
`machines are already at the limits imposed by architecture and device tech-
`nology. An effective way to attain small multiples of performance im-
`provement is to group together two or four identical processors.
`Some computer architects take note of a cost characteristic mentioned in
`Chapter 1. The discussion there indicates that high-speed device technology
`is much more expensive than lower-speed technology.
`Moreover, with today's devices the cost of fast devices tends to grow faster
`than the performance benefit of the increased device speed. Hence, the cost
`per unit of computing power tends to be greater for high-end machines than
`for low-end machines, although this trend is technology dependent and could
`change over time. Nevertheless, when lower-speed technology has a cost
`advantage, we have an opportunity to create a cost-effective high-
`performance system by combining hundreds or thousands of slow-speed pro-
`cessors built with low-cost devices.
`The cost advantage of using low-cost technology is balanced by the deg-
`radation in efficiency that inevitably occurs as the number of processors
`increases. If the degradation due to the large number of processors exceeds
`the cost advantage of the low-cost technology, then there is no particular
`advantage to using hundreds of slow processors over using a few very fast
`processors.
`Moreover, the complexity of programming a machine with hundreds of
`processors far exceeds the complexity of programming a single processor or a
`computer system with just a few processors. Consequently, although eco-
`nomics might enhance the attractiveness of a machine with hundreds of
`low-speed computers, the advantage of this structure disappears if efficiency
`is not held high.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 6
`
`
`
`Sec. 6.2
`
`Multiprocessor Performance
`
`283
`
`Thus, there is no particular magic in the parallelism of a multiprocessor.
`The parallelism yields a useful benefit when it successfully produces higher
`performance. When the parallelism cannot be tapped effectively, it simply
`adds to the system cost and complexity. In such a case, the end user is best
`served by reducing the parallelism to a point where the parallelism available
`can be used effectively. Whether there are ten, 1,000, or one million pro-
`cessors, it is bad practice to squander processing power. The argument that
`"processors are cheap" is irrelevant if, by using fewer processors, per-
`formance goes up.
`In the next section we address the question of efficiency more carefully,
`especially considering the ratio of the time spent executing useful in-
`structions compared to the time spent communicating with other processors.
`
`6.2 Multiprocessor Performance
`The point of this section is to analyze the performance benefits of multiple
`processors in the face of overhead incurred to create parallelism. The models
`studied are variations of models introduced by Indurkhya, Stone, and Xi-
`Cheng [1986].
`This section shows that performance benefits strongly depend on the
`ratio RIC, where R is the length of a run-time quantum and C is the length of
`communications overhead produced by that quantum. The ratio expresses
`how much overhead is incurred per unit of computation. When the ratio is
`very low, it becomes unprofitable to use parallelism. When the ratio is very
`high, parallelism is potentially profitable. Note that a large ratio can be
`obtained by partitioning a computing job into relatively few large pieces, and
`that the amount of parallelism for such a ratio might be much smaller than
`the maximum available.
`The ratio RIC is a measure of task granularity:
`• In coarse-grain parallelism, RIC is relatively high, so each unit of
`computation produces a relatively small amount of communication; and
`• In fine-grain parallelism, RIC is very low, so there is a relatively large
`amount of communication and other overhead per unit of computation.
`Coarse-grain parallelism arises when individual tasks are large and over-
`head can be amortized over many computational cycles. Fine-grain
`parallelism usually provides opportunities to perform execution on many
`more processors than can fruitfully support coarse-grained parallelism. The
`idea of fine-grain parallelism is to partition a program into increasingly
`smaller tasks that can run in parallel. At the ultimate limit, each individual
`task may be as small as a single operation. More commonly, however, a
`fine-grained task contains a small number of instructions.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 7
`
`
`
`284
`
`Multiprocessors
`
`Chap.6
`
`The programmer seeking maximum performance is strongly tempted to
`partition a problem into the finest possible granularity to create the max-
`imum amount of parallelism. But if the maximum parallelism also has the
`maximum overhead, it is not clear that maximum parallelism leads to the
`fastest solution.
`The main reason for the presentation of the performance models in this
`section is to show the pervasive role of the RIC ratio on performance. The
`discussion that follows shows how a fine-grain partition that happens to have
`a low RIC ratio produces poorer performance than a much coarser partition
`with a higher RIC ratio. Hence the much higher parallelism of the fine-grain
`partition need not produce higher net speed.
`The purpose of presenting a number of different performance models to
`make this point is that no one model is truly representative of multi-
`processors or of multiprocessor algorithms. We consider a number of differ-
`ent variations of the basic model to cover a variety of program behaviors and
`multiprocessor architectures. In every case, the role of RIC is the same. Small
`ratios lead to poor performance because of high overhead. Large ratios usu-
`ally reflect poor exploitation of parallelism. For maximum performance, it is
`necessary to balance parallelism against overhead. The only difference from
`model to model is the point where the two factors become balanced.
`Architects have long debated the relative qualities of fine and course
`granularity. For SIMD machines, the GF-11 is a coarse-grained machine
`whose individual processors can sustain a peak rate as high as 20 Mflops. The
`Connection Machine is an SIMD machine whose 1-bit processors are better
`suited to fine-grained tasks and whose performance stems from the massive
`number of processors rather than from the computational power of
`individual processor.
`What reasoning led the architects of one machine to seek such a vastly
`different solution than did the architects of the other machine? The range of
`applications is the primary motivation for the difference. The Connection
`Machine is designed to exploit parallelism of tasks such as image analysis, in
`which a significant portion of the work is characterized by fine-grained tasks.
`The GF-11, which is designed for much larger-grained tasks, would be bur-
`dened by overhead if the tasks carried the additional overhead attributable to
`fine granularity. Thus the architects of each machine attempted to match
`granularity to the applications for the machine.
`At one end of the multiprocessor scale are the Cray multiprocessors, such
`as the Cray XMP-a four-processor system in which each processor is a Cray I
`supercomputer. Under ideal circumstances, communication in this system
`occurs only at the end of major phases, which might well be every few million
`or few billion instructions.
`Smaller granularity is evident on microprocessor-based multiprocessors
`such as the Cosmic Cube and a number of commercial versions of this
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 8
`
`
`
`Sec. 6.2
`
`Multiprocessor Performance
`
`285
`
`hypercube-based design. These machines typically use 64 to 256 copies of a
`high-performance 32-bit microprocessor. The different granularity biases the
`machines somewhat to different application programs.
`The remainder of this section is devoted to performance models. In each
`model, observe how the ratio RIC determines the strategy that achieves the
`optimum performance. To simplify the models, we have generally ignored the
`effects of synchronization and contention except as crudely approximated by
`the models. In practical systems, the effects ignored here tend to lower
`performance from that predicted by these models. In most instances, the best
`way to compensate for the unmodeled effects is to increase the granularity of
`tasks.
`
`6.2.1 The Basic Model-Two Processors with Unoverlapped
`Communications
`For the first model, consider an application program that contains M tasks.
`Our objective is to execute this program at maximum speed on a system with
`N processors. For simplicity, we first consider a system with just two pro-
`cessors and then let the number of processors increase. To model per-
`formance we need to characterize the combination of execution time and
`overhead that will be incurred.
`Let us make the following assumptions to obtain our initial results. Sub-
`sequently we relax the assumptions and see how the performance changes.
`Specifically, we assume that:
`1. Each task executes in R uni ts of time; and
`2. Each task communicates with every other task at an overhead cost of C
`units of time when the communicating tasks are not on the same
`processor, and at no cost when the communicating tasks are coresident.
`We have various choices of how to execute such an application on a
`two-processor system. We can assign all tasks to one processor and ignore the
`second processor, which is a solution that minimizes communication
`overhead but fails to take advantage of available parallelism, or we can
`partition the tasks to the two processors in any combination. If the tasks are
`split across the processors, then the total execution time is a combination of
`the time spent in execution and the time spent engaged in overhead activities.
`Although we use the notation C as if C were exclusively due to communica-
`tion, it is convenient to lump overhead from all sources into C.
`To some extent, overhead can be overlapped with computation, especially
`if processors can perform communication through input/output ports while
`executing concurrently. However, not all sources of overhead can be hidden
`by overlapping with computation. Processors can contend for shared data or
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 9
`
`
`
`286
`
`Multiprocessors
`
`Chap.6
`
`shared communication paths, and they may be idle during synchronization
`periods. Therefore, we assume that some portion of overhead operations
`lengthen total processing time because overhead cannot be fully overlapped
`with computation. In this case the equation that describes total processing
`time is the following:
`Execution time= R Max(M -k,k) + C(M - k)k
`(6.1)
`Equation (6.1) expresses execution time as the sum of two terms, one attrib-
`uted to run time and one to communication and other overhead. The run time
`for two processors is the larger of the run times experienced and is therefore
`the larger of R (M - k) or Rk when k tasks are assigned to one processor and
`M - k to the other. The second term models overhead to be proportional to
`the number of pair-wise communications that must take place as a function
`of how tasks are partitioned to the two processors. Note that the first term is a
`linear function of k, and the second term is a quadratic function of k.
`What is the minimum execution time for Eq. (6.1) as a function of k? That
`is, how shall we assign tasks to two processors to produce the minimum
`execution time? Figure 6.2 shows a graphic way of finding a solution. The
`answer for this model is to assign all tasks to one processor if RIC is below
`Ml2, or split the tasks evenly between two processors if RIC exceeds that
`threshold. That is, either k = 0ork=M12. (If k is odd, then make k as close to
`Ml2 as possible.)
`Figure 6.2 shows the two different cases that arise for the different values
`of the RIC ratio. The first term of Eq. (6 .1) is piece-wise linear, and Fig. 6 .2(a)
`shows that this term looks like the letter V because it is symmetric at about
`the point k = M 12. In this figure, when the piece-wise linear term is added to
`the quadratic term, the resulting figure has a minimum at Ml2.
`In Fig. 6.2(b), the minimum occurs at k = 0. The minimum has to be at an
`extreme point in the region 0::; k::; M 12 because the quadratic curve k(M - k)
`is concave downward, and, after adding a linear term to this curve, the
`concavity is unchanged. A curve that is concave downward has its minimum
`at one of its endpoints. The endpoint of the curve at k = 0 (or at k = M) is the
`minimum when RIC< M 12; otherwise the minimum occurs at k = M 12.
`
`6.2.2 Extension to N Processors
`Now let's consider what happens when there are N processors. In this case,
`we assign k; tasks to the ith processor. The generalization of Eq. (6.1) becomes
`Execution time= R Max (k;) + Lk;(M - k;)
`
`(6.2)
`
`I
`
`= R Max (k;) +
`
`M 2
`
`-
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 10
`
`
`
`Sec. 6.2
`
`Multiprocessor Performance
`
`287
`
`Q) 80
`E
`i= 70
`r::
`.Q 60
`:;
`50
`>< w 40
`..... 30
`20
`10
`0
`
`60
`
`50
`Q) E
`i= 40
`r:: 0 :s 30
`u Q) >< w co 20
`
`10
`
`10
`
`30
`20
`Partition Parameter k
`(a)
`
`40
`
`M = 50
`RIC= 40
`
`Run Time
`
`0
`
`10
`
`20
`30
`Partition Parameter k
`(b)
`Fig. 6.2 Parallel execution time for two different RIC ratios:
`(a) Optimum partition parameter k = O; and
`(b) Optimum partition parameter k = M/2.
`
`40
`
`The first term counts the longest running time among the N execution times.
`To that time is added the overhead from the second term. That term counts
`the number of distinct pair-wise links between k; tasks and M - k; tasks, each
`of which contributes an amount C to the total time. The second term in Eq.
`(6.2) is quadratic just as in Eq. (6.1).
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 11
`
`
`
`288
`
`Multiprocessors
`
`Chap.6
`
`If the reasoning used to analyze Eq. (6.1) holds for this equation, then we
`expect that the minimum value is for an extreme assignment, and indeed this
`is the case. Either all tasks are assigned to a single processor, or they are
`distributed "evenly" across all processors. By "evenly," we mean that if Mis a
`multiple of N, then each processor receives MIN tasks. Otherwise, all but one
`processor receives the integer ceiling of MIN tasks, and one processor receives
`whatever is left over. This assignment does not necessarily use all N pro-
`cessors. For example, when there are 19 tasks and six processors, the assign-
`ment places four tasks on four processors and three tasks on a fifth processor,
`leaving no tasks assigned to the sixth processor.
`To show that the even distribution produces a local minimum, assume
`that k1 has the maximum number of tasks assigned to it, and show that an
`assignment in which two processors receive fewer than k1 tasks can be
`changed to an assignment with a lower cost, as computed by Eq. (6.2).
`For example, assume that both k2 and k3 satisfy k1 > k2 ;:::::: k3 ;:::::: 1. Consider
`the assignment that shifts one task from the third processor to the second
`processor and examine how the cost changes as per Eq. (6.2). The first term
`does not change because the change does not affect the maximum number of
`tasks assigned to a processor. The value of the second term is reduced, how-
`ever, by the amount C(k2 - k3 + 1). This assignment produces higher per-
`formance, and we can iterate this improvement process until no more than
`one processor has less than the maximum number of tasks assigned to it.
`Equation (6.2) has a threshold for an assignment, just as Eq. (6.1) has, and
`by a remarkable coincidence the thresholds are identical! We must compare
`the even assignment of tasks to the assignment that places all tasks on one
`processor. The latter assignment is preferred when RIC is sufficiently small.
`The difference in costs of the "even" distribution to N processors and a
`1-processor assignment is given by
`RM CM 2 CM 2
`+ -- - -- - RM
`Time difference= -
`(6.3)
`N
`2N
`2
`where the first three terms form the cost of the even distribution of tasks and
`the last term is the cost of assigning all tasks to one processor.
`To simplify the analysis, we have ignored values of M that are not exact
`multiples of N. To solve for the threshold value of RIC, we set the value of Eq.
`(6.3) to 0. By removing a factor of Mand then grouping terms by coefficients R
`and C, we can remove another factor of (1 -
`l!N). This yields the equation
`- R = 0
`Time difference=
`(6.4)
`
`or
`
`R M
`c
`2
`
`(6.5)
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 12
`
`
`
`Sec. 6.2
`
`Multiprocessor Performance
`
`289
`
`This model shows that if RIC is greater than the threshold M 12, then an even
`distribution of tasks to as many processors as are available will produce the
`best time. On the other hand, if RIC is below that threshold, then no matter
`how many processors are available, no assignment produces a faster time
`than the assignment that uses only one processor. Here is a situation in which
`the role of overhead becomes quite clear.
`Unless overhead is kept below a certain percentage of execution time,
`parallel execution cannot be beneficial. If this model holds for a parallel
`algorithm and architecture, then the control of overhead costs is absolutely
`essential for parallelism to be successful.
`Although this analysis has looked at performance rather than costs, RIC
`determines the point at which parallelism is cost-effective. Even when RIC is
`sufficiently high to warrant parallelism, the performance gain is diminished
`by the second term of Eq. (6.2). The speedup attributable to parallelism is the
`ratio of the time to run on one processor to the time expressed by Eq. (6.2).
`This is approximately
`
`(6.6)
`
`- 1))
`
`RM
`Speedup = -----'-"'-----
`_ CM 2
`(RM+ CM 2
`)
`2
`N
`2N
`R
`+ CM(l 2- l!N))
`RN
`c
`+ M(N
`2
`If the first term of the denominator is large compared with the second, then
`the speedup is proportional to N. This requires M and N to be small and for
`RIC to be large. If parallelism is increased to the extent that the denominator
`is dominated by its second term because N is very large, the speedup is
`proportional to RICM, which does not depend on the number of processors.
`Hence, as N increases, the speedup approaches a constant asymptote.
`At this point each processor added to the system brings extra cost while
`yielding negligible performance benefit. Even though performance can im-
`prove incrementally as processors are added, the diminishing returns in per-
`formance are not worth the added cost. The number of processors should not
`be increased beyond some maximum that is a function of cost and the ratio
`RIC.
`This model is a general picture of how granularity and overhead affect the
`performance gain of a multiprocessor, and it gives some indication of the
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 13
`
`
`
`290
`
`Multiprocessors
`
`Chap.6
`
`importance of minimizing overhead and selecting the right granularity. It is
`only one model, however, and it cannot encompass the full spectrum of actual
`applications.
`Let us alter the model in various ways and observe how the findings
`change. In general, we discover that RIC plays a critical role, regardless of the
`model. In some cases, there is the same type of threshold in which the best
`solutions are extreme. That is, use all available processors or just one pro-
`cessor, depending on the value of RIC. In some models, the extreme solutions
`are not the best. The best solutions for these models distribute work among
`several processors, but do not use all processors because the use of too many
`leads to performance degradation and extra cost. Moreover, in the general
`case, work need not be distributed evenly to achieve the optimum per-
`formance.
`
`6.2.3 A Stochastic Model
`Consider what happens when all tasks are not equal in execution time. The
`leading term in Eq. (6.2) is smallest when all processors run for equal lengths
`of time, so the objective is to scatter tasks among processors so that all
`processors are occupied for equal times. If this is not possible, the maximum
`running time among the processors should be as short as possible.
`The second term in Eq. (6.2) is smallest when tasks are distributed as
`unevenly as possible. Consequently, among all ways of distributing tasks to
`processors so that processors have nearly equal running times, find a distri-
`bution in which the number of tasks assigned to each processor is as uneven
`as possible. That is, find schemes that assign as few or as many tasks per
`processor as possible, subject to the requirement that the total workload on a
`processor be equal to a given amount.
`In this model, the best assignment need not be the most evenly distrib-
`uted workload. If the workload is slightly uneven, it may become possible to
`assign tasks to processors in such a way that overhead is greatly diminished.
`That is, a small increase in the linear first term of Eq.(6.2) can be more than
`balanced by a large decrease in the quadratic second term.
`A stochastic variation of the deterministic model presented here appears
`in Indurkhya, Stone, and Xi-Cheng [1986]. Instead of having all execution and
`communication times as fixed constants, the model assumes that the times
`are independent and identically distributed random variables with a mean R
`for the running times and a mean C for the communication times. To solve the
`model, lndurkhya et al. appeal to the Central Limit Theorem and the addi-
`tional assumption that
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 14
`
`
`
`Sec. 6.2
`
`Multiprocessor Performance
`
`291
`
`The E in Eq. (6.7) denotes the expected value. Equation (6.7) says that the
`maximum of a set of expected values of sums of independent and identically
`distributed random variables r;, the running times of the tasks, is equal to the
`expected value of the maximum of the sums. With these two assumptions, the
`model reduces to the deterministic model expressed by Eq.(6.2), and the
`results are identical.
`The assumption underlying Eq. (6.7) is actually false, as is stated by
`Indurkhya et al., but the point is that when the equation breaks down, it is
`close enough to being correct that the results produced are reasonably accu-
`rate. If one of the summations in Eq. (6.7) has many more summands than
`any other, then almost surely it has the maximum expected value, and its
`expected value is the value of both sides of Eq. (6.7). If two or more sum-
`mations have almost the same number of terms, and this number is max-
`imum among all equations, then it is possible for the left-hand side of Eq.
`(6.7) to select one summation and the right-hand of Eq. (6.7) to select another
`summation, but the values of summations will be fairly close, so that Eq. (6.7)
`is approximately if not exactly correct.
`Nicol [1986] explored the model more deeply and discovered that the
`results reported by Indurkhya et al. can be proved to be true in some instances
`without relying on Eq. (6.7). Indeed, the model appears to be robust in the
`sense that small perturbations in the underlying assumptions do not alter the
`gross conclusions from the model, a