`
`Reference 9
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 1
`
`
`
`Amdahl’s Law in the Multicore Era
`
`Mark D. Hill and Michael R. Marty
`
`Everyone knows Amdahl's Law, but quickly forgets it.
`-Dr. Thomas Puzak, IBM, 2007
`
`TABLE 1. Upper Bound on Speedup, f=0.99
`
`Dynamic
`
`Asymmetric
`
`Symmetric
`
`Base Amdahl
`
`Equivalents
`# Base Core
`
`16
`64
`256
`1024
`
`14
`39
`72
`91
`
`14
`39
`80
`161
`
`14
`49
`166
`531
`
`< 16
`< 60
`< 223
`< 782
`
`execution.The right-most three columns show the larger upper
`bounds on speedup enabled using richer cores deployed in
`symmetric, asymmetric, and dynamic multicore designs.
`
`Implications of our work include:
`• Not surprisingly, researchers and architects should
`aggressively attack serial bottlenecks in the multicore era.
`(cid:127) Increasing core performance, even if it appears locally
`inefficient, can be globally efficient by reducing the idle
`time of the rest of the chip’s resources.
`(cid:127) Chips with more resources tend to encourage designs
`with richer cores.
`(cid:127) Asymmetric multicore designs offer greater potential
`speedup than symmetric designs, provided challenges
`(e.g., scheduling) can be addressed.
`(cid:127) Dynamic designs—that
`temporarily harness cores
`together to speed sequential execution—have the poten-
`tial to achieve the best of both worlds.
`Overall, we show that Amdahl’s Law beckons multicore
`designers to view performance of the entire chip rather than
`zeroing in on core efficiencies. In the following sections, we
`first review Amdahl’s Law. We then present simple hardware
`models for symmetric, asymmetric, and dynamic multicore
`chips.
`
`Amdahl's Law Background
`Most computer scientists learned Amdahl Law's in school
`[5]. Let speedup be the original execution time divided by an
`enhanced execution time. The modern version of Amdahl's
`
`Abstract
`We apply Amdahl’s Law to multicore chips using symmet-
`ric cores, asymmetric cores, and dynamic techniques that
`allows cores to work together on sequential execution. To
`Amdahl’s simple software model, we add a simple hardware
`model based on fixed chip resources.
`A key result we find is that, even as we enter the multicore
`era, researchers should still seek methods of speeding sequen-
`tial execution. Moreover, methods that appear locally ineffi-
`cient (e.g., tripling sequential performance with a 9x resource
`cost) can still be globally efficient as they reduce the sequen-
`tial phase when the rest of the chip’s resources are idle.
`To reviewers: This paper’s accessible form is between a
`research contribution and a perspective. It seeks to stimulate
`discussion, controversy, and future work. In addition, it seeks
`to temper the current pendulum swing from the past’s under-
`emphasis on parallel research to a future with too little
`sequential research.
`
`Today we are at an inflection point in the computing land-
`scape as we enter the multicore era. All computing vendors
`have announced chips with multiple processor cores. More-
`over, vendor roadmaps promise to repeatedly double the num-
`ber of cores per chip. These future chips are variously called
`chip multiprocessors, multicore chips, and many-core chips.
`Designers of multicore chips must subdue more degrees of
`freedom than single-core designs. Questions include: How
`many cores? Should cores use simple pipelines or powerful
`multi-issue ones? Should cores use the same or different
`micro-architectures? In addition, designers must concurrently
`manage power from both dynamic and static sources.
`While answers to these questions are challenges for
`today’s multicore chip with 2-8 cores, they will get much more
`challenging in the future. Source as varied as Intel and Berke-
`ley predict a hundred [6] if not a thousand cores [2].
`It is our thesis that Amdahl's Law has important conse-
`quences for the future of our multicore era. Since most of us
`learned Amdahl's Law in school, all of our points are “known”
`at some level. Our goal is ensure we remember their implica-
`tions and avoid the pitfalls that Puzak fears.
`Table 1 foreshadows the results we develop for applica-
`tions that are 99% parallelizable. For varying number of base
`cores, the second column gives the upper bounds on speedup
`as predicted by Amdahl’s Law. In this paper, we develop a
`simple hardware model that reflects potential tradeoffs in
`devoting chip resources towards either parallel or sequential
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 2
`
`
`
`Architects should always increase core resources when
`perf(r) > r, because doing so speeds up both sequential and par-
`allel execution. When perf(r) < r, however, the tradeoff begins:
`increasing core performance aids sequential execution, but hurts
`parallel execution.
`Our equations allow perf(r) to be an arbitrary function, but
`all the graphs below assume perf(r) =
`r
`. In other words, we
`assume efforts that devote r BCE resources will result in perfor-
`r
`mance
`. Thus, architectures can double performance at a cost
`of 4 BCEs, triple it for 9 BCEs, etc. We tried other similar func-
`1.5
`r
`tions, e.g.,
`, but found no important changes to our results.
`
`Symmetric Multicore Chips
`A symmetric multicore chip requires that all its cores have
`the same cost. A symmetric multicore chip with a resource bud-
`get of n = 64 BCEs, for example, can support 64 cores of 1 BCE
`each, 16 cores of 4 BCEs each, or, in general, n/r cores of r
`BCEs each (rounded down to an integer number of cores). Fig-
`ures 1 and 2 show cartoons of two possible symmetric multicore
`chips for n = 16. The figures illustrate area, not power, as the
`chip’s limiting resource and omit important structures such as
`memory interfaces, shared caches, and interconnects.
`Under Amdahl's Law, the speedup of a symmetric multicore
`chip (relative to using one single-BCE core) depends on the soft-
`ware fraction that is parallelizable (f), total chip resources in
`BCEs (n), and the BCE resources (r) devoted to increase the per-
`formance of each core. The chip uses one core to execute
`sequentially at performance perf(r). It uses all n/r cores to exe-
`cute in parallel at performance perf(r)*n/r. Overall, we get:
`
`=
`
`Speedupsymmetric f n r
`,
`,
`(
`
`)
`
`1
`--------------------------------------------------
`f–
`f r⋅
`1
`------------------
`-------------------------
`+
`perf r( )
`perf r( ) n⋅
`To understand this equation, let’s begin with the upper-left
`graph of Figure 4. It assumes a symmetric multicore chip of n =
`r
`16 BCEs and perf(r) =
`. The x-axis gives resources used to
`increase performance of each core: a value 1 says the chip has 16
`base cores, while 16 uses all resources for a single core. Lines
`assume different values for the fraction parallel (f=0.5, 0.9, ...,
`0.999). The y-axis gives the speedup of the symmetric multicore
`chip (relative to running on one single-BCE base core). The
`maximum speedup for f=0.9, for example, is 6.7 using 8 cores of
`cost 2 BCEs each. The remaining left-hand graphs give speedups
`for symmetric multicore chips with chip resources of n = 64,
`256, and 1024 BCEs.
`Result 1: Amdahl’s Law applies to multicore chips, as achieving
`good speedups requires f’s that are very near 1. Thus, finding
`parallelism is critical.
`Implication 1: Researchers should target increasing f via archi-
`tectural support, compiler techniques, programming model
`improvements, etc.
`Implication 1 is both most obvious and most important.
`Recall, however, that speedups much less than n can still be cost
`effective.1
`Result 2: Using more BCEs per core, r > 1, can be optimal, even
`r
`. For a given f, the maxi-
`when performance grows by only
`
`Law states that if one enhances a fraction f of a computation by a
`speedup S, then the overall speedup is:
`Speedupenhanced f S,(
`
`)
`
`=
`
`1
`-------------------------
`1 f–(
`--+
`)
`
`fS-
`
`Amdahl's Law applies broadly and has important corollaries
`such as:
`(cid:127) Attack the common case: If f is small, your optimizations
`will have little effect.
`(cid:127) But the aspects you ignore also limit speedup:
`As S goes to infinity, Speedup goes to1/(1-f).
`Four decades ago, Amdahl originally defined his law for the
`special case of using n processors (cores today) in parallel when
`he argued for the Validity of the Single Processor Approach to
`Achieving Large Scale Computing Capabilities [1]. He simplisti-
`cally assumed that a fraction f of a program's execution time was
`infinitely parallelizable with no overhead, while the remaining
`fraction, 1-f, was totally sequential. Without presenting an equa-
`tion, he noted that the speedup on n processors is governed by:
`1
`Speedupparallel f n,(
`-------------------------
`1 f–(
`--+
`)
`
`)
`
`=
`
`fn-
`
`Finally, he argued that typical values of 1-f were large
`enough to favor single processors.
`While Amdahl's arguments were simple, they held and
`mainframes with one or a few processors dominated the comput-
`ing landscape. They also largely held in minicomputer and per-
`sonal computer eras that followed. As recent technology trends
`usher us into the multicore era, we investigate whether Amdahl’s
`Law is still relevant.
`
`A Simple Cost Model for Multicore Chips
`To apply Amdahl's Law to a multicore chip, we need a cost
`model for the number and performance of cores that the chip can
`support. Herein we develop a simple hardware model in the
`spirit of Amdahl's simple software model.
`We first assume that a multicore chip of given size and tech-
`nology generation can contain at most n base core equivalents
`(BCEs), where a single BCE implements the baseline core. This
`limit comes from the resources a chip designer is willing to
`devote to processor cores (with L1 caches). It does not include
`chip resources expended on shared caches, interconnections,
`memory controllers, etc. Rather we simplistically assume that
`these non-processor resources are roughly constant in the multi-
`core variations we consider.
`We are agnostic on what limits a chip to n BCEs. It may be
`power; it may be area; it may be some combination of power,
`area, and other factors.
`We second assume that (micro-) architects have techniques
`for using the resources of multiple BCEs to create a richer core
`with greater sequential performance. Let the performance of a
`single-BCE core be 1. We specifically assume that architects can
`expend the resources of r BCEs to create a rich core with
`sequential performance perf(r).
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 3
`
`
`
`Note: These cartoons omit important structures and assume area, not power, is a chip’s limiting resource.
`
`Figure 3. Asymmetric
`Figure 2. Symmetric
`Figure 1. Symmetric
`One 4-BCE core, 12 1-BCE cores
`Sixteen 1-BCE cores
`Four 4-BCE cores
`mum speedup can occur at 1 big core, n base cores, or with an
`symmetric speedups. The symmetric curves typically show
`intermediate number of middle-sized cores. Consider, n=16.
`either immediate performance improvement or performance loss
`With f=0.5, one core (of cost 16 BCEs) gives the best speedup of
`as the chip uses more powerful cores, depending on the level of
`4. With f=0.975, 16 single-BCE cores provide a speedup of 11.6.
`parallelism. In contrast, asymmetric chips reach a maximum
`With n=64, f=0.9, 9 cores of 7.1 BCEs each provides an overall
`speedup in the middle of the extremes.
`speedup of 13.3.
`Result 4: Asymmetric multicore chips can offer maximum
`Implication 2: Researchers should seek methods of increasing
`speedups that are much greater than symmetric multicore chips
`(and never worse). For f=0.975 and n=256, for example, the best
`core performance even at a high cost.
`asymmetric speedup is 125.0 whereas the best symmetric
`Result 3: Moving to denser chips increases the likelihood that
`speedup 51.2. For n=1024 and the same f, the difference
`cores should be non-minimal. Even at f=0.99, minimal base
`increases to 364.5 versus 102.5. This result follows from
`cores are optimal at chip sizes n=16 and 64, but more powerful
`Amdahl’s idealized software assumptions, wherein software is
`cores help at n=256 and 1024.
`either completely sequential or completely parallel.
`Implication 3: Even as Moore’s Law allows larger chips,
`Implication 4: Researchers should continue to investigate
`researchers should look for ways to design more powerful cores.
`asymmetric multicore chips. However, real chips must deal with
`many challenges, such as scheduling different phases of parallel-
`ism with real overheads. Furthermore, chips may need multiple
`larger cores for multiprogramming and workloads that exhibit
`overheads not captured by Amdahl’s model.
`Result 5: Denser multicore chips increase both the speedup ben-
`efit of going asymmetric (see above) and the optimal perfor-
`mance of the single large core. For f=0.975 and n=1024, for
`example, best speedup is obtained with one core of 345 BCEs
`and 679 single-BCE cores.
`Implication 5: Researchers should investigate methods of
`speeding sequential performance even if they appear locally
`r
`inefficient, e.g., perf(r) =
`. This is because these methods can
`be globally efficient as they reduce the sequential phase when
`the chip’s other n-r cores are idle.
`
`Asymmetric Multicore Chips
`An alternative to a symmetric multicore chip is an asymmet-
`ric multicore chip where one or more cores are more powerful
`than the others [3, 8, 9, 12]. With the simplistic assumptions of
`Amdahl's Law, it makes most sense to devote extra resources to
`increase the capability of only one core, as shown in Figure 3.
`With a resource budget of n=64 BCEs, for example, an asym-
`metric multicore chip can have one 4-BCE core and 60 1-BCE
`cores, one 9-BCE core and 55 1-BCE cores, etc. In general, the
`chip can have 1+n-r cores since the single larger core uses r
`resources and leaves n-r resources for the 1-BCE cores.
`Amdahl's Law has a different effect on an asymmetric multi-
`core chip. This chip uses the one core with more resources to
`execute sequentially at performance perf(r). In the parallel frac-
`tion, however, it gets performance perf(r) from the large core
`and performance 1 from each of the n-r base cores. Overall, we
`get:
`
`Speedupasymmetric f n r
`,
`,
`(
`
`)
`
`=
`
`1
`-------------------------------------------------------------
`1 f–
`f
`------------------
`------------------------------------
`+
`perf r( )
`perf r( )
`n r–+
`
`The asymmetric speedup curves are shown in Figure 4.
`These curves are markedly different from the corresponding
`
`1. A system is cost-effective if its speedup exceeds its costup
`[13]. Multicore costup is the multicore system cost divided by
`the single-core system cost. Since this costup is often much less
`than n, speedups less than n can be cost effective.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 4
`
`
`
`Symmetric, n = 16
`
`Asymmetric, n = 16
`
`4
`r BCEs
`
`8
`
`16
`
`Asymmetric, n = 64
`
`4
`
`8
`r BCEs
`
`16
`
`32
`
`64
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`
`2
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`2
`
`16
`
`14
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`asymmetric
`
`Speedup
`
`asymmetric
`
`Speedup
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`
`2
`
`4
`r BCEs
`
`8
`
`16
`
`Symmetric, n = 64
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`
`2
`
`4
`
`8
`r BCEs
`
`16
`
`32
`
`64
`
`16
`
`14
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`symmetric
`
`Speedup
`
`symmetric
`
`Speedup
`
`250
`
`f=0.999
`
`Symmetric, n = 256
`
`Asymmetric, n = 256
`
`250
`
`f=0.999
`
`200
`
`150
`
`100
`
`f=0.99
`
`50
`
`f=0.975
`
`f=0.9
`f=0.5
`
`asymmetric
`
`Speedup
`
`4
`
`8
`
`16
`r BCEs
`
`32
`
`64
`
`128
`
`256
`
`2
`
`4
`
`8
`
`16
`r BCEs
`
`32
`
`64
`
`128
`
`256
`
`Symmetric, n = 1024
`
`Asymmetric, n = 1024
`
`f=0.99
`
`200
`
`150
`
`100
`
`symmetric
`
`Speedup
`
`50
`
`f=0.975
`
`f=0.9
`f=0.5
`2
`
`f=0.99
`
`f=0.975
`
`f=0.9
`f=0.5
`32
`r BCEs
`
`64
`
`128
`
`256
`
`512
`
`1024
`
`f=0.999
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`asymmetric
`
`Speedup
`
`64
`
`128
`
`256
`
`512
`
`1024
`
`2
`
`4
`
`8
`
`16
`
`f=0.999
`
`f=0.99
`f=0.975
`f=0.9
`f=0.5
`
`2
`
`4
`
`8
`
`16
`
`32
`r BCEs
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`symmetric
`
`Speedup
`
`Figure 4. Speedup of Symmetric and Asymmetric Multicore chips.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 5
`
`
`
`locally inefficient, as with asymmetric chips, the methods can be
`globally efficient. While these methods may be difficult to apply
`under Amdahl’s extreme assumptions, they could flourish for
`software with substantial phases of intermediate-level parallel-
`ism.
`
`Conclusions
`To Amdahl’s simple software model, we add a simple hard-
`ware model and compute speedups for symmetric, asymmetric,
`and dynamic multicore chips:
`
`Speedupsymmetric f n r
`,
`(
`,
`
`)
`
`=
`
`Speedupasymmetric f n r
`,
`(
`,
`
`)
`
`=
`
`1
`---------------------------------------------------
`f r⋅
`1 f–
`--------------------------
`------------------
`+
`perf r( ) n⋅
`perf r( )
`
`1
`-------------------------------------------------------------
`f
`f–
`1
`------------------------------------
`------------------
`+
`perf r( ) n r–+
`perf r( )
`
`1
`--------------------------
`1 f–
`--+
`----------------
`perf(r)
`
`fn-
`
`Speedupdynamic f n r
`,
`,
`(
`
`)
`
`=
`
`Results first reaffirm that seeking greater parallelism is criti-
`cal to obtaining good speedups. We then show how core designs
`that are locally inefficient can be globally efficient.
`Of course, our model seeks insight by making many simpli-
`fying assumptions. The real world is much more complex.
`Amdahl’s simple software model may not hold reasonably for
`future software. Future mainstream parallel software may also
`behave differently from today’s highly tuned parallel scientific
`and database codes. Our simple hardware model does not
`account for the complex tradeoffs between cores, cache capacity,
`interconnect resources, and off-chip bandwidth. Nevertheless,
`we find value in the controversy and discussion that drafts of this
`paper have already stimulated.
`We thank Shailender Chaudhry, Robert Cypher, Anders Lan-
`din, José F. Martínez, Kevin Moore, Andy Phelps, Thomas
`Puzak, Partha Ranganathan, Karu Sankaralingam, Mike Swift,
`Marc Tremblay, Sam Williams, David Wood, and the Wisconsin
`Multifacet group for their comments and/or proofreading. This
`work is supported in part by the National Science Foundation
`(NSF), with grants EIA/CNS-0205286, CCR-0324878, and
`CNS-0551401, as well as donations from Intel and Sun Micro-
`systems. Hill has significant financial interest in Sun Microsys-
`tems. The views expressed herein are not necessarily those of the
`NSF, Intel, or Sun Microsystems.
`
`sequential
`mode
`
`parallel
`mode
`
`Figure 5. Dynamic
`Sixteen 1-BCE cores
`
`Dynamic Multicore Chips
`What if architects could have their cake and eat it too? Con-
`sider dynamically combining up to r cores together to boost per-
`formance of only the sequential component, as shown in
`Figure 5. This could be possible with thread-level speculation,
`helper threads, etc. [4, 7, 10, 11]. In sequential mode, this
`dynamic multicore chip can execute with performance perf(r)
`when the dynamic techniques can use r BCEs. In parallel mode,
`a dynamic multicore gets performance n using all base cores in
`parallel. Overall, we get:
`
`1
`-------------------------
`f–
`1
`----------------
`--+
`perf(r)
`
`fn-
`
`Speedupdynamic f n r
`,
`,
`(
`
`)
`
`=
`
`Of course dynamic techniques may have some additional
`resource overhead (e.g., area) not reflected in the equation as
`well as additional runtime overhead when combining and split-
`ting cores. Figure 6 displays dynamic speedups when using r
`r
`cores in sequential mode for perf(r) =
`. Light grey lines give
`the corresponding values for asymmetric speedup. The graphs
`show that performance always gets better as more BCE
`resources can be exploited to improve the sequential component.
`Practical considerations, however, may keep r much smaller
`than its maximum of n.
`Result 6: Dynamic multicore chips can offer speedups that can
`be greater, and are never worse, than asymmetric chips. With
`Amdahl’s sequential-parallel assumption, however, achieving
`much greater speedup than asymmetric chips requires that
`dynamic techniques harness large numbers of BCEs in sequen-
`tial mode. For f=0.999 and n=256, for example, the dynamic
`speedup attained when 256 BCEs can be utilized in sequential
`mode is 963. However the comparable asymmetric speedup is
`748. This result follows because we assume that dynamic chips
`can both gang together all resources for sequential execution and
`free them for parallel execution.
`Implication 6: Researchers should continue to investigate meth-
`ods that approximate a dynamic multicore chip, such as thread
`level speculation, and helper threads. Even if the methods appear
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 6
`
`
`
`References
`[1] G. M. Amdahl. Validity of the Single-Processor Approach
`to Achieving Large Scale Computing Capabilities. In
`AFIPS Conference Proceedings, pages 483–485, 1967.
`[2] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro,
`Joseph James Gebis, Parry Husbands, Kurt Keutzer,
`David A. Patterson, William Lester Plishker, John Shalf,
`Samual Webb Williams, and Katherine A. Yelick. The
`Landscape of Parallel Computing Research: A View from
`Berkeley. Technical Report Technical Report No.
`UCB/EECS-2006-183, EECS Department, University of
`California, Berkeley, 2006.
`[3] Saisanthosh Balakrishnan, Ravi Rajwar, Michael Upton,
`and Konrad Lai. The Impact of Performance Asymmetry in
`Emerging Multicore Architectures. In ISCA 32, June 2005.
`[4] Lance Hammond, Mark Willey, and Kunle Olukotun. Data
`Speculation Support for a Chip Multiprocessor. In ASPLOS
`8, pages 58–69, October 1998.
`[5] John L. Hennessy and David A. Patterson. Computer
`Architecture: A Quantitative Approach. Morgan
`Kaufmann, third edition, 2003.
`[6] From a Few Cores to Many: A Tera-scale Computing
`Research
`Overview.
`ftp://download.intel.com/research/platform/terascale/terasc
`ale_overview_pape% r.pdf, 2006.
`[7] Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F.
`Martinez. Core Fusion: Accomodating Software Diversity
`in Chip Multiprocessors. In ISCA 34, June 2007.
`[8] J.A. Kahl, M.N. Day, H.P. Hofstee, C.R. Johns, T.R.
`Maeurer, and D. Shippy.
`Introduction
`to
`the Cell
`IBM
`Journal
`of Research
`and
`Multiprocessor.
`Development, 49(4), 2005.
`[9] Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi,
`Parthasarathy Ranganathan, and Dean M. Tullsen. Single-
`ISA Heterogeneous Multi-Core Architectures: The
`Potential for Processor Power Reduction. In MICRO 36,
`December 2003.
`[10] Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti
`Sarangi, James Tuck, and Josep Torrellas. Energy-Efficient
`Thread-Level Speculation on a CMP. IEEE Micro, 26(1),
`Jan/Feb 2006.
`[11] G.S. Sohi, S. Breach, and T.N. Vijaykumar. Multiscalar
`Processors. In ISCA 22, pages 414–425, June 1995.
`[12] M. Aater Suleman, Yale N. Patt, Eric A. Sprangle, Anwar
`Rohillah, Anwar Ghuloum, and Doug Carmean. ACMP:
`Balancing Hardware Efficiency
`and Programmer
`Efficiency. Technical Report HPS Technical Report, TR-
`HPS-2007-001, University of Texas, Austin, February
`2007.
`[13] David A. Wood and Mark D. Hill. Cost-Effective Parallel
`Computing. IEEE Computer, pages 69–72, February 1995.
`
`Dynamic, n = 16
`
`4
`r BCEs
`
`8
`
`16
`
`Dynamic, n = 64
`
`4
`
`8
`r BCEs
`
`16
`
`32
`
`64
`
`Dynamic, n = 256
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`
`2
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`
`f=0.5
`2
`
`16
`
`14
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`dynamic
`
`Speedup
`
`dynamic
`
`Speedup
`
`f=0.999
`
`f=0.99
`
`250
`
`200
`
`150
`
`100
`
`dynamic
`
`Speedup
`
`50
`
`f=0.975
`
`f=0.9
`f=0.5
`2
`
`1000
`
`900
`
`800
`
`700
`
`600
`
`500
`
`400
`
`300
`
`200
`
`100
`
`dynamic
`
`Speedup
`
`4
`
`8
`
`16
`r BCEs
`
`32
`
`64
`
`128
`
`256
`
`Dynamic, n = 1024
`
`f=0.999
`
`f=0.99
`
`f=0.975
`
`f=0.9
`f=0.5
`
`32
`r BCEs
`
`2
`
`4
`
`8
`
`16
`
`64
`
`128
`
`256
`
`512
`
`1024
`
`Figure 6. Speedup of Dynamic Multicore Chips
`(light lines show asymmetric speedup)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2121, p. 7
`
`