throbber
Homayoun
`
`Reference 42
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 1
`
`

`

`Multiprocessors
`
`Thus far we have treated methods for speeding up a single instruction stream.
`Although there is but a single program in execution, the designs discussed
`earlier exploit concurrency within the instruction stream and within individ-
`ual instructions. In this chapter we turn to the discussion of multiprocessors-
`computer systems composed of several independent processors. The mo-
`tivation for moving towards multiple processors is strictly a matter of
`performance because device technology places an upper bound on the speed
`of any single processor. To exceed that bound requires multiple processors.
`The central themes of this chapter are multiprocessor structures and
`performance. Our objective is to show several interesting techniques for or-
`ganizing multiple processors into highly parallel systems and to give insight
`into the potential performance improvements and bottlenecks of such sys-
`tems. Chapter 7 treats software strategies for using the available parallelism
`of these systems.
`
`278
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 2
`
`

`

`Sec. 6.1
`
`Background
`
`279
`
`6.1 Background
`Our earlier discussions of high-performance machines study two important
`classes of parallelism. Pipeline machines produce high performance by plac-
`ing several stages of a pipeline in operation simultaneously. Machines for
`continuum calculations have multiple processors, each executing the same
`program. In both cases, a single program is used to operate on vectors or
`arrays of data. Flynn [1966] termed this type of parallelism single-instruction
`stream, multiple-data stream (SIMD) parallelism. Recall, for example, an ex-
`treme implementation of this idea in the form of the GF-11, ih which each of
`576 processors executes identical instructions broadcast to them by a single
`control unit.
`Another SIMD machine with massive parallelism is the Connection Ma-
`chine [Hillis 1986] with 64K 1-bit processors. The architect is truly fortunate
`when an application can be executed on machines that are built around the
`lock-step parallelism required for SIMD machines because the architecture
`efficiently executes programs well suited to SIMD execution.
`High performance on such machines requires rewriting conventional al-
`gorithms to manipulate many data simultaneously by means of instructions
`broadcast to all processors. Although programming for these machines can
`be difficult in principle, in the ideal case, a serial algorithm can be converted
`to an SIMD algorithm by replacing each inner loop with a single broadcast
`instruction that implements the complete loop. The fact that an important,
`but limited, class of problems fits this model extremely well has provided the
`impetus for the design and construction of these machines.
`Clearly, some large problems do not lend themselves to efficient exe-
`cution in an SIMD architecture. The operations required for such problems
`cannot easily be organized into repetitive operations on uniformly structured
`data. They tend to be unstructured and unpredictable. Addressing patterns
`tend to be data dependent, so the architecture cannot easily preload data by
`anticipating future accesses.
`The architect who must attain high performance for such problems inevi-
`tably looks for a solution in a multiprocessor structure. Such an architecture
`is composed of several independent computers, each capable of executing its
`own program. Flynn [1966] calls this type of architecture multiple-instruction
`stream, multiple-data stream, (MIMD) architecture. The processors of a multi-
`processor are interconnected in some fashion to permit programs to exchange
`data and synchronize activities.
`A model of such an architecture is shown in Fig. 6.1. In this figure each
`processor has registers, arithmetic and logic units, and access to memory and
`input/output modules. In Fig. 6. l(a) we show the memory and input/output
`systems as separate subsystems shared among all of the processors. Figure
`6.l(b) shows the memory and input/output units attached to individual pro-
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 3
`
`

`

`280
`
`Multiprocessors
`
`Chap. 6
`
`Processor 1
`
`Processor 2
`
`Processor N
`
`Processor 1
`
`Processor 2
`
`Processor N
`
`Memory
`
`Memory
`
`Memory
`
`1/0
`
`1/0
`
`Interconnection
`Network
`
`Interconnection
`Network
`
`(a)
`
`Memory
`
`Memory
`
`Memory
`
`1/0
`
`1/0
`
`1/0
`
`(b)
`Fig. 6.1 Two multiprocessor structures:
`(a) All memory and 110 are remote and shared; and
`(b) All memory and 1/0 are local and private.
`
`cessors. No sharing of memory and input/output is permitted in Fig. 6.l(b). In
`both cases, because the system contains multiple processors, each capable of
`executing an independent program, the system fits Flynn's MIMD model.
`In both systems depicted in Fig. 6.1 the processors cooperate by ex-
`changing data through the interconnection system and by synchronizing
`activities. The shared memory in Fig. 6.l(a) provides a convenient means for
`information interchange and synchronization since any pair of processors
`can communicate through a shared location. The structure in Fig. 6.l(b)
`supports communication through point-to-point exchange of information.
`Obviously, multiprocessors can have any reasonable combination of shared
`global memory or private local memory. Fig. 6.1 shows the extremes in the
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 4
`
`

`

`Sec. 6.1
`
`Background
`
`281
`
`design space, and practical designs lie at the extremes or anywhere in be-
`tween.
`The main purpose of a high-speed multiprocessor is to complete a job
`faster by using several machines concurrently than can be done by using a
`single copy of the same machine. In some applications, the main purpose for
`using multiple processors is for reliability rather than high performance. The
`idea is that if any single processor fails, its workload can be performed by
`other processors in the system. Since the design principles of such systems
`are quite different from the principles that guide the design of high-
`performance systems, we do not address design for reliability in this text, but
`rather we limit our attention to issues related to performance.
`When a multiprocessor is operating at peak performance, all processors
`are engaged in useful work. No processor is idle, and no processor is executing
`an instruction that would not be executed if the same algorithm were exe-
`cuting on a single processor. In this state of peak performance, all N pro-
`cessors of a multiprocessor are contributing to effective performance, and the
`processing rate is increased by a factor of N.
`Peak performance is a very special state that is rarely achievable. There
`are several factors that introduce inefficiency. Among the factors are:
`• The delays introduced by interprocessor communications;
`• The overhead in synchronizing the work of one processor with another;
`• Lost efficiency when one or more processors run out of tasks;
`• Lost efficiency due to wasted effort by one or more processors; and
`• The processing costs for controlling the system and scheduling oper-
`ations.
`Both scheduling and synchronization are sources of overhead on serial
`machines. In citing these factors together with the other factors, we are citing
`how they degrade multiprocessor performance beyond the effects that may
`already be present on individual processors.
`A high-performance vector processor is free from many of the problems,
`but it does suffer from lost performance because it is unable to keep all of the
`processing units busy. This latter problem arises particularly when a
`computation is not easily implemented as a sequence of vector operations
`performed on highly structured, densely stored data.
`The architect who designs and builds a multiprocessor must pay close
`attention to the sources of inefficiency exposed here. They can lead to serious
`degradation in performance. For example, if the combined inefficiencies pro-
`duce an effective processing rate of only ten percent of the peak rate, then ten
`processors are required in a multiprocessor system just to do the work of a
`single processor.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 5
`
`

`

`282
`
`Multiprocessors
`
`Chap.6
`
`Fortunately, for a small number of processors, careful design can hold the
`inefficiency to a low figure, but inefficiencies tend to climb as the number of
`processors increase. There is a point where adding additional processors can
`lengthen, not shorten, computation time.
`The fact that inefficiency tends to grow with the number of processors is
`the underlying reason why many commercial offerings of multiprocessors
`have a small number of processors, such as 4, 8, or 16. The fastest machines
`are built from the fastest devices available and have relatively few processors.
`Consider, for example, the Cray XMP, a four-processor version of the Cray
`I. Another example is the IBM 309X family for which from one to six pro-
`cessor systems are available. Both of these implementations start with very
`high-speed devices and use architectural techniques such as cache and pipe-
`lining to produce very high-performance single processors for their respective
`markets.
`Users of these machines may have workloads or individual problems
`whose needs exceed the capacity of a single machine. Additional performance
`is not readily available from faster versions of the same machine because the
`machines are already at the limits imposed by architecture and device tech-
`nology. An effective way to attain small multiples of performance im-
`provement is to group together two or four identical processors.
`Some computer architects take note of a cost characteristic mentioned in
`Chapter 1. The discussion there indicates that high-speed device technology
`is much more expensive than lower-speed technology.
`Moreover, with today's devices the cost of fast devices tends to grow faster
`than the performance benefit of the increased device speed. Hence, the cost
`per unit of computing power tends to be greater for high-end machines than
`for low-end machines, although this trend is technology dependent and could
`change over time. Nevertheless, when lower-speed technology has a cost
`advantage, we have an opportunity to create a cost-effective high-
`performance system by combining hundreds or thousands of slow-speed pro-
`cessors built with low-cost devices.
`The cost advantage of using low-cost technology is balanced by the deg-
`radation in efficiency that inevitably occurs as the number of processors
`increases. If the degradation due to the large number of processors exceeds
`the cost advantage of the low-cost technology, then there is no particular
`advantage to using hundreds of slow processors over using a few very fast
`processors.
`Moreover, the complexity of programming a machine with hundreds of
`processors far exceeds the complexity of programming a single processor or a
`computer system with just a few processors. Consequently, although eco-
`nomics might enhance the attractiveness of a machine with hundreds of
`low-speed computers, the advantage of this structure disappears if efficiency
`is not held high.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 6
`
`

`

`Sec. 6.2
`
`Multiprocessor Performance
`
`283
`
`Thus, there is no particular magic in the parallelism of a multiprocessor.
`The parallelism yields a useful benefit when it successfully produces higher
`performance. When the parallelism cannot be tapped effectively, it simply
`adds to the system cost and complexity. In such a case, the end user is best
`served by reducing the parallelism to a point where the parallelism available
`can be used effectively. Whether there are ten, 1,000, or one million pro-
`cessors, it is bad practice to squander processing power. The argument that
`"processors are cheap" is irrelevant if, by using fewer processors, per-
`formance goes up.
`In the next section we address the question of efficiency more carefully,
`especially considering the ratio of the time spent executing useful in-
`structions compared to the time spent communicating with other processors.
`
`6.2 Multiprocessor Performance
`The point of this section is to analyze the performance benefits of multiple
`processors in the face of overhead incurred to create parallelism. The models
`studied are variations of models introduced by Indurkhya, Stone, and Xi-
`Cheng [1986].
`This section shows that performance benefits strongly depend on the
`ratio RIC, where R is the length of a run-time quantum and C is the length of
`communications overhead produced by that quantum. The ratio expresses
`how much overhead is incurred per unit of computation. When the ratio is
`very low, it becomes unprofitable to use parallelism. When the ratio is very
`high, parallelism is potentially profitable. Note that a large ratio can be
`obtained by partitioning a computing job into relatively few large pieces, and
`that the amount of parallelism for such a ratio might be much smaller than
`the maximum available.
`The ratio RIC is a measure of task granularity:
`• In coarse-grain parallelism, RIC is relatively high, so each unit of
`computation produces a relatively small amount of communication; and
`• In fine-grain parallelism, RIC is very low, so there is a relatively large
`amount of communication and other overhead per unit of computation.
`Coarse-grain parallelism arises when individual tasks are large and over-
`head can be amortized over many computational cycles. Fine-grain
`parallelism usually provides opportunities to perform execution on many
`more processors than can fruitfully support coarse-grained parallelism. The
`idea of fine-grain parallelism is to partition a program into increasingly
`smaller tasks that can run in parallel. At the ultimate limit, each individual
`task may be as small as a single operation. More commonly, however, a
`fine-grained task contains a small number of instructions.
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 7
`
`

`

`284
`
`Multiprocessors
`
`Chap.6
`
`The programmer seeking maximum performance is strongly tempted to
`partition a problem into the finest possible granularity to create the max-
`imum amount of parallelism. But if the maximum parallelism also has the
`maximum overhead, it is not clear that maximum parallelism leads to the
`fastest solution.
`The main reason for the presentation of the performance models in this
`section is to show the pervasive role of the RIC ratio on performance. The
`discussion that follows shows how a fine-grain partition that happens to have
`a low RIC ratio produces poorer performance than a much coarser partition
`with a higher RIC ratio. Hence the much higher parallelism of the fine-grain
`partition need not produce higher net speed.
`The purpose of presenting a number of different performance models to
`make this point is that no one model is truly representative of multi-
`processors or of multiprocessor algorithms. We consider a number of differ-
`ent variations of the basic model to cover a variety of program behaviors and
`multiprocessor architectures. In every case, the role of RIC is the same. Small
`ratios lead to poor performance because of high overhead. Large ratios usu-
`ally reflect poor exploitation of parallelism. For maximum performance, it is
`necessary to balance parallelism against overhead. The only difference from
`model to model is the point where the two factors become balanced.
`Architects have long debated the relative qualities of fine and course
`granularity. For SIMD machines, the GF-11 is a coarse-grained machine
`whose individual processors can sustain a peak rate as high as 20 Mflops. The
`Connection Machine is an SIMD machine whose 1-bit processors are better
`suited to fine-grained tasks and whose performance stems from the massive
`number of processors rather than from the computational power of
`individual processor.
`What reasoning led the architects of one machine to seek such a vastly
`different solution than did the architects of the other machine? The range of
`applications is the primary motivation for the difference. The Connection
`Machine is designed to exploit parallelism of tasks such as image analysis, in
`which a significant portion of the work is characterized by fine-grained tasks.
`The GF-11, which is designed for much larger-grained tasks, would be bur-
`dened by overhead if the tasks carried the additional overhead attributable to
`fine granularity. Thus the architects of each machine attempted to match
`granularity to the applications for the machine.
`At one end of the multiprocessor scale are the Cray multiprocessors, such
`as the Cray XMP-a four-processor system in which each processor is a Cray I
`supercomputer. Under ideal circumstances, communication in this system
`occurs only at the end of major phases, which might well be every few million
`or few billion instructions.
`Smaller granularity is evident on microprocessor-based multiprocessors
`such as the Cosmic Cube and a number of commercial versions of this
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 8
`
`

`

`Sec. 6.2
`
`Multiprocessor Performance
`
`285
`
`hypercube-based design. These machines typically use 64 to 256 copies of a
`high-performance 32-bit microprocessor. The different granularity biases the
`machines somewhat to different application programs.
`The remainder of this section is devoted to performance models. In each
`model, observe how the ratio RIC determines the strategy that achieves the
`optimum performance. To simplify the models, we have generally ignored the
`effects of synchronization and contention except as crudely approximated by
`the models. In practical systems, the effects ignored here tend to lower
`performance from that predicted by these models. In most instances, the best
`way to compensate for the unmodeled effects is to increase the granularity of
`tasks.
`
`6.2.1 The Basic Model-Two Processors with Unoverlapped
`Communications
`For the first model, consider an application program that contains M tasks.
`Our objective is to execute this program at maximum speed on a system with
`N processors. For simplicity, we first consider a system with just two pro-
`cessors and then let the number of processors increase. To model per-
`formance we need to characterize the combination of execution time and
`overhead that will be incurred.
`Let us make the following assumptions to obtain our initial results. Sub-
`sequently we relax the assumptions and see how the performance changes.
`Specifically, we assume that:
`1. Each task executes in R uni ts of time; and
`2. Each task communicates with every other task at an overhead cost of C
`units of time when the communicating tasks are not on the same
`processor, and at no cost when the communicating tasks are coresident.
`We have various choices of how to execute such an application on a
`two-processor system. We can assign all tasks to one processor and ignore the
`second processor, which is a solution that minimizes communication
`overhead but fails to take advantage of available parallelism, or we can
`partition the tasks to the two processors in any combination. If the tasks are
`split across the processors, then the total execution time is a combination of
`the time spent in execution and the time spent engaged in overhead activities.
`Although we use the notation C as if C were exclusively due to communica-
`tion, it is convenient to lump overhead from all sources into C.
`To some extent, overhead can be overlapped with computation, especially
`if processors can perform communication through input/output ports while
`executing concurrently. However, not all sources of overhead can be hidden
`by overlapping with computation. Processors can contend for shared data or
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 9
`
`

`

`286
`
`Multiprocessors
`
`Chap.6
`
`shared communication paths, and they may be idle during synchronization
`periods. Therefore, we assume that some portion of overhead operations
`lengthen total processing time because overhead cannot be fully overlapped
`with computation. In this case the equation that describes total processing
`time is the following:
`Execution time= R Max(M -k,k) + C(M - k)k
`(6.1)
`Equation (6.1) expresses execution time as the sum of two terms, one attrib-
`uted to run time and one to communication and other overhead. The run time
`for two processors is the larger of the run times experienced and is therefore
`the larger of R (M - k) or Rk when k tasks are assigned to one processor and
`M - k to the other. The second term models overhead to be proportional to
`the number of pair-wise communications that must take place as a function
`of how tasks are partitioned to the two processors. Note that the first term is a
`linear function of k, and the second term is a quadratic function of k.
`What is the minimum execution time for Eq. (6.1) as a function of k? That
`is, how shall we assign tasks to two processors to produce the minimum
`execution time? Figure 6.2 shows a graphic way of finding a solution. The
`answer for this model is to assign all tasks to one processor if RIC is below
`Ml2, or split the tasks evenly between two processors if RIC exceeds that
`threshold. That is, either k = 0ork=M12. (If k is odd, then make k as close to
`Ml2 as possible.)
`Figure 6.2 shows the two different cases that arise for the different values
`of the RIC ratio. The first term of Eq. (6 .1) is piece-wise linear, and Fig. 6 .2(a)
`shows that this term looks like the letter V because it is symmetric at about
`the point k = M 12. In this figure, when the piece-wise linear term is added to
`the quadratic term, the resulting figure has a minimum at Ml2.
`In Fig. 6.2(b), the minimum occurs at k = 0. The minimum has to be at an
`extreme point in the region 0::; k::; M 12 because the quadratic curve k(M - k)
`is concave downward, and, after adding a linear term to this curve, the
`concavity is unchanged. A curve that is concave downward has its minimum
`at one of its endpoints. The endpoint of the curve at k = 0 (or at k = M) is the
`minimum when RIC< M 12; otherwise the minimum occurs at k = M 12.
`
`6.2.2 Extension to N Processors
`Now let's consider what happens when there are N processors. In this case,
`we assign k; tasks to the ith processor. The generalization of Eq. (6.1) becomes
`Execution time= R Max (k;) + Lk;(M - k;)
`
`(6.2)
`
`I
`
`= R Max (k;) +
`
`M 2
`
`-
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 10
`
`

`

`Sec. 6.2
`
`Multiprocessor Performance
`
`287
`
`Q) 80
`E
`i= 70
`r::
`.Q 60
`:;
`50
`>< w 40
`..... 30
`20
`10
`0
`
`60
`
`50
`Q) E
`i= 40
`r:: 0 :s 30
`u Q) >< w co 20
`
`10
`
`10
`
`30
`20
`Partition Parameter k
`(a)
`
`40
`
`M = 50
`RIC= 40
`
`Run Time
`
`0
`
`10
`
`20
`30
`Partition Parameter k
`(b)
`Fig. 6.2 Parallel execution time for two different RIC ratios:
`(a) Optimum partition parameter k = O; and
`(b) Optimum partition parameter k = M/2.
`
`40
`
`The first term counts the longest running time among the N execution times.
`To that time is added the overhead from the second term. That term counts
`the number of distinct pair-wise links between k; tasks and M - k; tasks, each
`of which contributes an amount C to the total time. The second term in Eq.
`(6.2) is quadratic just as in Eq. (6.1).
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 11
`
`

`

`288
`
`Multiprocessors
`
`Chap.6
`
`If the reasoning used to analyze Eq. (6.1) holds for this equation, then we
`expect that the minimum value is for an extreme assignment, and indeed this
`is the case. Either all tasks are assigned to a single processor, or they are
`distributed "evenly" across all processors. By "evenly," we mean that if Mis a
`multiple of N, then each processor receives MIN tasks. Otherwise, all but one
`processor receives the integer ceiling of MIN tasks, and one processor receives
`whatever is left over. This assignment does not necessarily use all N pro-
`cessors. For example, when there are 19 tasks and six processors, the assign-
`ment places four tasks on four processors and three tasks on a fifth processor,
`leaving no tasks assigned to the sixth processor.
`To show that the even distribution produces a local minimum, assume
`that k1 has the maximum number of tasks assigned to it, and show that an
`assignment in which two processors receive fewer than k1 tasks can be
`changed to an assignment with a lower cost, as computed by Eq. (6.2).
`For example, assume that both k2 and k3 satisfy k1 > k2 ;:::::: k3 ;:::::: 1. Consider
`the assignment that shifts one task from the third processor to the second
`processor and examine how the cost changes as per Eq. (6.2). The first term
`does not change because the change does not affect the maximum number of
`tasks assigned to a processor. The value of the second term is reduced, how-
`ever, by the amount C(k2 - k3 + 1). This assignment produces higher per-
`formance, and we can iterate this improvement process until no more than
`one processor has less than the maximum number of tasks assigned to it.
`Equation (6.2) has a threshold for an assignment, just as Eq. (6.1) has, and
`by a remarkable coincidence the thresholds are identical! We must compare
`the even assignment of tasks to the assignment that places all tasks on one
`processor. The latter assignment is preferred when RIC is sufficiently small.
`The difference in costs of the "even" distribution to N processors and a
`1-processor assignment is given by
`RM CM 2 CM 2
`+ -- - -- - RM
`Time difference= -
`(6.3)
`N
`2N
`2
`where the first three terms form the cost of the even distribution of tasks and
`the last term is the cost of assigning all tasks to one processor.
`To simplify the analysis, we have ignored values of M that are not exact
`multiples of N. To solve for the threshold value of RIC, we set the value of Eq.
`(6.3) to 0. By removing a factor of Mand then grouping terms by coefficients R
`and C, we can remove another factor of (1 -
`l!N). This yields the equation
`- R = 0
`Time difference=
`(6.4)
`
`or
`
`R M
`c
`2
`
`(6.5)
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 12
`
`

`

`Sec. 6.2
`
`Multiprocessor Performance
`
`289
`
`This model shows that if RIC is greater than the threshold M 12, then an even
`distribution of tasks to as many processors as are available will produce the
`best time. On the other hand, if RIC is below that threshold, then no matter
`how many processors are available, no assignment produces a faster time
`than the assignment that uses only one processor. Here is a situation in which
`the role of overhead becomes quite clear.
`Unless overhead is kept below a certain percentage of execution time,
`parallel execution cannot be beneficial. If this model holds for a parallel
`algorithm and architecture, then the control of overhead costs is absolutely
`essential for parallelism to be successful.
`Although this analysis has looked at performance rather than costs, RIC
`determines the point at which parallelism is cost-effective. Even when RIC is
`sufficiently high to warrant parallelism, the performance gain is diminished
`by the second term of Eq. (6.2). The speedup attributable to parallelism is the
`ratio of the time to run on one processor to the time expressed by Eq. (6.2).
`This is approximately
`
`(6.6)
`
`- 1))
`
`RM
`Speedup = -----'-"'-----
`_ CM 2
`(RM+ CM 2
`)
`2
`N
`2N
`R
`+ CM(l 2- l!N))
`RN
`c
`+ M(N
`2
`If the first term of the denominator is large compared with the second, then
`the speedup is proportional to N. This requires M and N to be small and for
`RIC to be large. If parallelism is increased to the extent that the denominator
`is dominated by its second term because N is very large, the speedup is
`proportional to RICM, which does not depend on the number of processors.
`Hence, as N increases, the speedup approaches a constant asymptote.
`At this point each processor added to the system brings extra cost while
`yielding negligible performance benefit. Even though performance can im-
`prove incrementally as processors are added, the diminishing returns in per-
`formance are not worth the added cost. The number of processors should not
`be increased beyond some maximum that is a function of cost and the ratio
`RIC.
`This model is a general picture of how granularity and overhead affect the
`performance gain of a multiprocessor, and it gives some indication of the
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 13
`
`

`

`290
`
`Multiprocessors
`
`Chap.6
`
`importance of minimizing overhead and selecting the right granularity. It is
`only one model, however, and it cannot encompass the full spectrum of actual
`applications.
`Let us alter the model in various ways and observe how the findings
`change. In general, we discover that RIC plays a critical role, regardless of the
`model. In some cases, there is the same type of threshold in which the best
`solutions are extreme. That is, use all available processors or just one pro-
`cessor, depending on the value of RIC. In some models, the extreme solutions
`are not the best. The best solutions for these models distribute work among
`several processors, but do not use all processors because the use of too many
`leads to performance degradation and extra cost. Moreover, in the general
`case, work need not be distributed evenly to achieve the optimum per-
`formance.
`
`6.2.3 A Stochastic Model
`Consider what happens when all tasks are not equal in execution time. The
`leading term in Eq. (6.2) is smallest when all processors run for equal lengths
`of time, so the objective is to scatter tasks among processors so that all
`processors are occupied for equal times. If this is not possible, the maximum
`running time among the processors should be as short as possible.
`The second term in Eq. (6.2) is smallest when tasks are distributed as
`unevenly as possible. Consequently, among all ways of distributing tasks to
`processors so that processors have nearly equal running times, find a distri-
`bution in which the number of tasks assigned to each processor is as uneven
`as possible. That is, find schemes that assign as few or as many tasks per
`processor as possible, subject to the requirement that the total workload on a
`processor be equal to a given amount.
`In this model, the best assignment need not be the most evenly distrib-
`uted workload. If the workload is slightly uneven, it may become possible to
`assign tasks to processors in such a way that overhead is greatly diminished.
`That is, a small increase in the linear first term of Eq.(6.2) can be more than
`balanced by a large decrease in the quadratic second term.
`A stochastic variation of the deterministic model presented here appears
`in Indurkhya, Stone, and Xi-Cheng [1986]. Instead of having all execution and
`communication times as fixed constants, the model assumes that the times
`are independent and identically distributed random variables with a mean R
`for the running times and a mean C for the communication times. To solve the
`model, lndurkhya et al. appeal to the Central Limit Theorem and the addi-
`tional assumption that
`
`PATENT OWNER DIRECTSTREAM LLC
`EX. 2154, p. 14
`
`

`

`Sec. 6.2
`
`Multiprocessor Performance
`
`291
`
`The E in Eq. (6.7) denotes the expected value. Equation (6.7) says that the
`maximum of a set of expected values of sums of independent and identically
`distributed random variables r;, the running times of the tasks, is equal to the
`expected value of the maximum of the sums. With these two assumptions, the
`model reduces to the deterministic model expressed by Eq.(6.2), and the
`results are identical.
`The assumption underlying Eq. (6.7) is actually false, as is stated by
`Indurkhya et al., but the point is that when the equation breaks down, it is
`close enough to being correct that the results produced are reasonably accu-
`rate. If one of the summations in Eq. (6.7) has many more summands than
`any other, then almost surely it has the maximum expected value, and its
`expected value is the value of both sides of Eq. (6.7). If two or more sum-
`mations have almost the same number of terms, and this number is max-
`imum among all equations, then it is possible for the left-hand side of Eq.
`(6.7) to select one summation and the right-hand of Eq. (6.7) to select another
`summation, but the values of summations will be fairly close, so that Eq. (6.7)
`is approximately if not exactly correct.
`Nicol [1986] explored the model more deeply and discovered that the
`results reported by Indurkhya et al. can be proved to be true in some instances
`without relying on Eq. (6.7). Indeed, the model appears to be robust in the
`sense that small perturbations in the underlying assumptions do not alter the
`gross conclusions from the model, a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket