throbber
Homayoun
`
`Reference 10
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 1
`
`

`

`C O V E R F E A T U R E
`
`Heterogeneous Chip
`Multiprocessors
`
`Heterogeneous (or asymmetric) chip multiprocessors present unique
`opportunities for improving system throughput, reducing processor power,
`and mitigating Amdahl’s law. On-chip heterogeneity allows the processor
`to better match execution resources to each application’s needs and to
`address a much wider spectrum of system loads—from low to high thread
`parallelism—with high efficiency.
`
`Rakesh
`Kumar
`Dean M.
`Tullsen
`University of
`California,
`San Diego
`
`Norman P.
`Jouppi
`Parthasarathy
`Ranganathan
`HP Labs
`
`W ith the announcement of multicore
`
`microprocessors from Intel, AMD,
`IBM, and Sun Microsystems, chip
`multiprocessors have recently ex-
`panded from an active area of
`research to a hot product area. If Moore’s law con-
`tinues to apply in the chip multiprocessor (CMP)
`era, we can expect to see a geometrically increasing
`number of cores with each advance in feature size.
`A critical question in CMPs is the size and strength
`of the replicated core. Many server applications
`focus primarily on throughput per cost and power.
`Ideally, a CMP targeted for these applications would
`use a large number of small low-power cores. Much
`of the initial research in CMPs focused on these
`types of applications.1,2 However, desktop users are
`more interested in the performance of a single appli-
`cation or a few applications at a given time. A CMP
`designed for desktop users would more likely be
`focused on a smaller number of larger, higher-power
`cores with better single-thread performance. How
`should designers choose between these conflicting
`requirements in core complexity?
`
`Table 1. Power and relative performance of Alpha cores scaled to
`0.10 µm. Performance is expressed normalized to EV4 performance.
`
`Core
`
`EV4
`EV5
`EV6
`EV8
`
`Peak power
`(Watts)
`
`Average power
`(Watts)
`
`Performance
`(norm. IPC)
`
`4.97
`9.83
`17.8
`92.88
`
`3.73
`6.88
`10.68
`46.44
`
`1.00
`1.30
`1.87
`2.14
`
`In reality, application needs are not always so
`simply characterized, and many types of applica-
`tions can benefit from either the speed of a large
`core or the efficiency of a small core at various
`points in their execution. Further, the best fit of
`application to processor can also be dependent on
`the system context—for example, whether a laptop
`is running off a battery or off wall power.
`Thus, we believe the best choice in core com-
`plexity is “all of the above”— a heterogeneous chip
`microprocessor with both high- and low-complex-
`ity cores. Recent research in heterogeneous, or asym-
`metric, CMPs has identified significant advantages
`over homogeneous CMPs in terms of power and
`throughput and in addressing the effects of Amdahl’s
`law on the performance of parallel applications.
`
`HETEROGENEITY’S POTENTIAL
`Table 1 shows the power dissipation and perfor-
`mance of four generations of Alpha microproces-
`sor cores scaled to the same 0.10 µm feature size
`and assumed to run at the same clock frequency.
`Figure 1 shows the relative sizes of these cores. All
`cores put together comprise less than 15 percent
`more area than EV8—the largest core—by itself.
`This data is representative of the past 20 years of
`microprocessor evolution, and similar data exists
`for x86 processors.3
`Although the number of transistors per micro-
`processor core has increased greatly—with attendant
`increases in area, power, and design complexity—
`this complexity has caused only a modest increase in
`application performance, as opposed to performance
`due to faster clock rates from technology scaling.
`Thus, while the more complex cores provide higher
`
`32
`
`Computer
`
`P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y
`
`0018-9162/05/$20.00 © 2005 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 2
`
`

`

`EV8-
`
`EV4
`
`EV5
`
`EV6
`
`Figure 1. Relative sizes of the Alpha cores scaled to 0.10 µm. EV8 is 80 times
`bigger but provides only two to three times more single-threaded performance.
`
`801
`601
`401
`201
`Committed instructions (millions)
`
`EV8-
`
`EV6
`
`EV5
`
`EV4
`
`EV8-
`EV6
`EV5
`EV4
`
`EV8-
`EV6
`EV5
`EV4
`
`2.0
`
`1.6
`
`1.2
`
`0.8
`
`IPS (millions)
`
`0.4
`
`0
`
`(a)
`
`Core-switching
`
`for energy
`
`(b)
`
`Core-switching
`
`for ED
`
`(c)
`
`Figure 2. Applu benchmark resource requirements. (a) Performance of applu on
`the four cores; (b) Oracle switching for energy; and (c) Oracle switching for
`energy-delay product. Switchings are infrequent, hence total switching overhead
`is minimal.
`
`single-thread performance, this comes with a loss of
`area and power efficiency.
`Further, in addition to varying in their resource
`requirements, applications can have significantly
`different resource requirements during different
`phases of their execution. This is illustrated for the
`applu benchmark in Figure 2.
`Some application phases might have a large
`amount of instruction-level parallelism (ILP),
`which can be exploited by a core that can issue
`many instructions per cycle, that is, a wide-issue
`superscalar CPU. The same core, however, might
`be very inefficient for an application phase with lit-
`tle ILP, consuming significantly more power (even
`after the application of gating- or voltage/frequency-
`scaling-based techniques) than a simpler core that is
`better matched to the application’s characteristics.
`Therefore, in addition to changes in performance
`over time, significant changes occur in the relative
`performance of the candidate cores.
`In Figure 2, sometimes the difference in perfor-
`mance between the biggest and smallest core is less
`than a factor of two, sometimes more than a factor
`of 10. Thus, the best core for executing an appli-
`cation phase can vary considerably during a pro-
`gram’s execution. Fortunately, much of the benefit
`of heterogeneous execution can be obtained with
`relatively infrequent switching between cores, on
`the order of context-switch intervals. This greatly
`reduces the overhead of switching between cores
`to support heterogeneous execution.
`Heterogeneous multicore architectures have been
`around for a while, in the form of system-on-chip
`designs, for example. However, in such systems,
`each core performs a distinct task. More recently,
`researchers have proposed multi-ISA multicore
`architectures.4 Such processors have cores that exe-
`cute instructions belonging to different ISAs, and
`they typically address vector/data-level parallelism
`and instruction-level parallelism simultaneously.
`Not all cores can execute any given instruction. In
`contrast, in single-ISA heterogeneous CMPs, each
`core executes the same ISA, hence each application
`or application phase can be mapped to any of the
`cores. Single-ISA heterogeneous CMPs are an exam-
`ple of a multicore-aware processor architecture. The
`“Multicore-Aware Processor Architecture” sidebar
`provides additional information about this type of
`architecture.
`
`POWER ADVANTAGES
`Using single-ISA heterogeneous CMPs can sig-
`nificantly reduce processor power dissipation. As
`processors continue to increase in performance and
`
`speed, processor power consumption and heat dis-
`sipation have become key challenges in the design of
`future high-performance systems. Increased power
`consumption and heat dissipation typically lead to
`
`November 2005
`
`33
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 3
`
`

`

`Multicore-Aware Processor Architecture
`
`As we move from a world dominated by uniprocessors to one
`likely to be dominated by multiprocessors on a chip, we have
`the opportunity to approach the architecture of these systems
`in new ways. Developers need to architect each piece of the sys-
`tem not to stand alone, but to be a part of the whole. In many
`cases, this requires very different thinking than what prevailed
`for uniprocessor architectures.
`Heterogeneous CMP design is just one example of multicore-
`aware architecture. In the uniprocessor era, we designed one
`architecture for the universe of applications. Thus, for applica-
`tions that demand high instruction throughput, applications
`with variable control flow and latencies, applications that access
`large data sets, or applications with heavy floating-point
`throughput, the best processor is superscalar and dynamically
`scheduled, with large caches and multiple floating-point units,
`and so on. However, few, if any, applications actually need all
`those resources. Thus, such an architecture is highly overprovi-
`sioned for any single application.
`In a chip multiprocessor, however, no single core need be ideal
`for the universe of applications. Employing heterogeneity
`exploits this principle. Conversely, a homogeneous design actu-
`ally exacerbates the overprovisioning problem by creating a sin-
`gle universal design, then replicating that overprovisioned design
`across the chip.
`Heterogeneous CMP design, however, is not the only exam-
`ple of multicore-aware architecture: Two other examples are a
`conjoined core architecture and CMP/interconnect codesign.
`
`Blurring the lines between cores
`Some level of processor overprovisioning is necessary for mar-
`ket and other considerations, whether on homogeneous or het-
`erogeneous CMPs, because it increases each core’s flexibility.
`What we really want, though, is to have the same level of over-
`provisioning available to any single thread without multiplying
`the cost with the number of cores. In conjoined-core chip mul-
`tiprocessing,1 for example, adjacent cores share overprovisioned
`structures by requiring only minor modifications to the floor
`plan.
`
`In the uniprocessor era, the lines between cores were always
`distinct, and the cores could share only very remote resources.
`With conjoined cores, those lines aren’t necessary on a CMP.
`Figure A shows the floorplan of two adjacent cores of a CMP
`sharing a floating-point unit and level-1 caches. Each core can
`access the shared structures either in turn during fixed allocated
`cycles—for example, one core gets access to the shared struc-
`ture every odd cycle while the other core gets access every even
`cycle—or sharing can be based on certain dynamic conditions
`visible to both the cores. Sharing should occur without com-
`munication between cores, which is expensive.
`Conjoining reduces the number of overprovisioned structures
`by half. In the Figure A example, conjoining results in having
`only four floating-point units on an eight-core CMP—one per
`conjoined-core pair. Each core gets full access to the floating-
`point unit unless the other core needs access to it at the same
`time. Applying intelligent, complexity-effective sharing mecha-
`nisms can minimize the performance impact of reduced band-
`width between the shared structure and a core.
`The chief advantage of having conjoined cores is a significant
`reduction in per-core real estate with minimal impact on per-
`core performance, providing a higher computational capability
`per unit area. Conjoining can result in reducing the area devoted
`to cores by half, with no more than a 10 to 12 percent degra-
`dation in single-thread performance.1 This can be used either to
`decrease the area of the entire die, increase the yield, or support
`more cores for a given fixed die size. Ancillary benefits include
`a reduction in leakage power from fewer transistors for a given
`computational capability.
`
`CMP/interconnect codesign
`In the uniprocessor era, high-performance processors con-
`nected by high-performance interconnects yielded high-perfor-
`mance systems. On a CMP, the design issues are more complex
`because the cores, caches, and interconnect all reside on the same
`chip and compete for the same area and power budget. Thus,
`the design choices for the cores, caches, and interconnects can
`interact to a much greater degree. For example, an aggressive
`
`higher costs for thermal packaging, fans, electricity,
`and even air conditioning. Higher-power systems
`can also have a greater incidence of failures.
`Industry currently uses two broad classes of tech-
`niques for power reduction: gating-based and volt-
`age/frequency-scaling-based. Both of these tech-
`niques exploit program behavior for power reduc-
`tion and are applied at a single-core level. However,
`any technique applied at this level suffers from lim-
`itations. For example, consider clock gating. Gating
`circuitry itself has power and area overhead, hence
`it typically cannot be applied at the lowest granu-
`larity. Thus, some dynamic power is still dissipated
`even for inactive blocks. In addition, data must still
`be transmitted long distances over unused portions
`of the chip that have been gated off, which con-
`
`sumes a substantial amount of power. Also, gating
`helps reduce only the dynamic power. Large unused
`portions of the chip still dissipate leakage power.
`Voltage/frequency scaling-based techniques suffer
`from similar limitations.
`Given the ability to dynamically switch between
`cores and power down unused cores to eliminate
`leakage, recent work has shown reductions in
`processor energy delay product as high as 84 per-
`cent (a sixfold improvement) for individual appli-
`cations and 63 percent overall.5 Energy-delay2—the
`product of energy and the square of the delay—
`reductions are as high as 75 percent (a fourfold
`improvement), 50 percent overall.
`Ed Grochowski and colleagues3 found that using
`asymmetric cores could easily improve energy per
`
`34
`
`Computer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 4
`
`

`

`interconnect design consumes power and area resources that
`can then constrain the number, size, and design of cores and
`caches. Similarly, the number and type of cores, as well as on-
`chip memory, can also dictate requirements on the interconnect.
`Increasing the number of cores places conflicting demands on
`the interconnect, requiring higher bandwidth while decreasing
`available real estate.
`A recent study shows how critical the interconnect can be for
`multicore design.2 On an eight-core processor, for example, even
`under conservative assumptions, the interconnect can consume
`the power equivalent of one core, take the area equivalent of
`three cores, and add delay that accounts for over half the L2
`access latency.
`Such high overheads imply that it isn’t possible to design a
`good interconnect in isolation from the design of the CPU cores
`and memory. Cores, caches, and interconnect should all be co-
`designed for the best performance or energy efficiency. A good
`example of the need for codesign is that decreasing intercon-
`nection bandwidth can sometimes improve performance due to
`the constrained window on total resources—for example, if it
`enables larger caches that decrease interconnect pressure. In the
`same way, excessively large caches can also decrease perfor-
`mance when they constrain the interconnect to too small an area.
`Hence, designing a CMP architecture is a system-level opti-
`mization problem. Some common architectural beliefs do not hold
`when interconnection overheads are properly accounted for. For
`example, shared caches are not as desirable compared to private
`caches if the cost of the associated crossbar is carefully factored in.2
`Any new CMP architecture proposal should consider inter-
`connect as a first-class citizen, and all CMP research proposals
`should include careful interconnect modeling for correct and
`meaningful results and analysis.
`
`References
`1. R. Kumar, N.P. Jouppi, and D. Tullsen, “Conjoined-Core Chip Mul-
`tiprocessing,” Proc. Int’l Symp. Microarchitecture, IEEE CS Press,
`2004, pp. 195-206.
`2. R. Kumar, V. Zyuban, and D. Tullsen, “Interconnection in Multi-
`
`core Architectures: Understanding Mechanisms, Overheads, and
`Scaling,” Proc. Int’l Symp. Computer Architecture, IEEE CS Press,
`2005, pp. 408-419.
`
`F-box
`
`L1 I-cache
`
`I-box
`
`Int DP
`Int RF
`
`FP DP
`
`FP RF
`
`E-box
`
`M-box
`
`(1)
`
`C-box
`
`L1 D-cache
`
`L1 I-cache
`F-box
`
` I-box
`
`E-box
`
`I-box
`
`E-box
`
`M-box
`
`Int DP
`Int RF
`
`Int DP
`Int RF
`
`FP DP
`
`FP RF
`
`M-box
`
`C-box
`
`C-box
`
`L1 D-cache
`
`(2)
`
`Figure A. CMP floorplan. (1) The original core and (2) a conjoined-
`core pair, both showing floating-point unit routing. Routing and
`register files are schematic and not drawn to scale.
`
`instruction by four to six times. In comparison,
`given today’s already low core voltages of around
`1 volt, voltage scaling could provide at most a two
`to four times improvement in energy per instruc-
`tion. Various types of gating could provide up to
`two times improvement in energy per instruction,
`while controlling speculation could reduce energy
`per instruction by up to 40 percent. These tech-
`niques are not mutually exclusive, and voltage scal-
`ing is largely orthogonal to heterogeneity. However,
`heterogeneity provided the single largest potential
`improvement in energy efficiency per instruction.
`
`THROUGHPUT ADVANTAGES
`For two reasons, given a fixed circuit area, using
`heterogeneous multicore architectures instead of
`
`homogeneous CMPs can provide significant per-
`formance advantages for a multiprogrammed
`workload.6 First, a heterogeneous multicore archi-
`tecture can match each application to the core best
`suited to meet its performance demands. Second,
`it can provide improved area-efficient coverage of
`the entire spectrum of workload demands seen in
`a real machine, from low thread-level parallelism
`that provides low latency for few applications on
`powerful cores to high thread-level parallelism in
`which simple cores can host a large number of
`applications simultaneously.
`So, for example, in a homogeneous architecture
`with four large cores, it would be possible to
`replace one large core with five smaller cores, for a
`total of eight cores. In the best case, intelligently
`
`November 2005
`
`35
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 5
`
`

`

`sampling policies are used to choose good assign-
`ments from the reduced assignment space.
`We observed that learning policies that assume
`the current configuration has merit and that the
`next configuration will perform particularly well.6
`The graph in Figure 3 also demonstrates the pri-
`mary benefit of heterogeneity. Using four big cores
`yields good few-threads performance, and using
`many small cores (more than 20) yields high peak
`throughput. Only the heterogeneous architecture
`provides high performance across all levels of
`thread parallelism.
`With these policies, this architecture provides
`even better coverage of a spectrum of load levels. It
`provides the low latency of powerful processors at
`low threading levels, but is also comparable to a
`larger array of small processors at high-thread
`occupancy.
`These thread assignment heuristics can be quite
`useful even for homogeneous CMPs in which each
`core is multithreaded. Such CMPs face many of the
`same problems regarding explosion of assignment
`space. In some sense, such CMPs can be thought
`of as heterogeneous CMPs for scheduling purposes
`where the heterogeneity stems from different mar-
`ginal performance and power characteristics for
`each SMT context.
`As computing objectives keep switching back
`and forth between single-thread performance and
`throughput, we believe that single-ISA heteroge-
`neous multicore architectures provide a convenient
`and seamless way to address both concerns simul-
`taneously.
`
`MITIGATING AMDAHL’S LAW
`Amdahl’s law7 states that the speedup of a par-
`allel application is limited by the fraction of the
`application that is serial. In modern CMPs, the
`overall power dissipation is an important limit.
`Murali Annavaram and coauthors8 point out a use-
`ful application for heterogeneous CMPs in power-
`constrained environments.
`During serial portions of execution, the chip’s
`power budget is applied toward using a single large
`core to allow the serial portion to execute as quickly
`as possible. During the parallel portions, the chip’s
`power budget is used more efficiently by running
`the parallel portion on a large number of small
`area- and power-efficient cores. Thus, executing
`serial portions of an application on a fast but rela-
`tively inefficient core and executing parallel por-
`tions of an algorithm on many small cores can
`maximize the ratio of performance to power dissi-
`pation.
`
`eqv-EV5-homogeneous
`eqv-EV6-homogeneous
`3EV6+ & 5EV5 (pref-core)
`3EV6+ & 5EV5 (pref-similar)
`
`5
`
`15
`10
`Number of threads
`
`20
`
`25
`
`12345678
`
`0
`
`Weighted speedup
`
`Figure 3. Performance of heuristics for equal-area heterogeneous architectures
`with multithreaded cores. EV6+ is an SMT variant of EV6. Pref-core biases sam-
`pling toward a new core over a new SMT context. Pref-similar biases sampling
`toward the last good configuration.
`
`scheduling jobs on the smaller cores that would
`have seen no significant benefit from the larger core
`would yield the performance of eight large cores in
`the space of four.
`Overall, a representative heterogeneous processor
`using two core types achieves as much as a 63 percent
`performance improvement over an equivalent-area
`homogeneous processor. Over a range of moderate
`load levels—five to eight threads, for example—the
`heterogeneous CMP has an average gain of 29 per-
`cent. For an open system with random job arrivals,
`the heterogeneous architecture has a much lower
`average response time and remains stable for arrival
`rates 43 percent higher than for a homogeneous
`architecture. Dynamic thread-to-core assignment
`policies have also been demonstrated that realize
`most of the potential performance gain. One simple
`assignment policy outperformed naive core assign-
`ment by 31 percent.6
`Heterogeneity also can be beneficial in systems
`with multithreaded cores. Despite the additional
`scheduling complexity that simultaneous multi-
`threading cores pose due to an explosion in the pos-
`sible assignment permutations, effective assignment
`policies can be formulated that do derive signifi-
`cant benefit from heterogeneity.6
`Figure 3 shows the performance of several heuris-
`tics for heterogeneous systems with multithreaded
`cores. These heuristics prune the assignment space
`by making assumptions regarding the relative ben-
`efits of running on a simpler core versus running
`on a simultaneous multithreaded core. Various
`
`36
`
`Computer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 6
`
`

`

`Using a simple prototype built from a discrete
`four-way multiprocessor, Annavaram and col-
`leagues show a 38 percent wall clock speedup for
`a parallel application given a fixed power budget.
`Single-chip heterogeneous multiprocessors with
`larger numbers of processors should be able to
`obtain even larger improvements in speed/power
`product on parallel applications.
`
`of software to tolerate heterogeneity need
`more investigation. As future systems include
`support for greater virtualization, similar
`issues must be addressed at the virtual
`machine layer as well.
`These software changes will likely enhance
`the already demonstrated advantages of het-
`erogeneous CMPs.
`
`WHAT HETEROGENEITY MEANS
`FOR SOFTWARE
`To take full advantage of heterogeneous CMPs,
`the system software must use the execution char-
`acteristics of each application to predict its future
`processing needs and then schedule it to a core that
`matches those needs if one is available. The pre-
`dictions can minimize the performance loss to
`the system as a whole rather than that of a single
`application.
`Recent work has shown that effective schedulers
`for heterogeneous architectures can be implemented
`and integrated with current commercial operating
`systems.9 An experimental platform running Gentoo
`Linux with a 2.6.7 kernel modified to support het-
`erogeneity-aware scheduling resulted in a 40 percent
`power savings, for a performance loss of less than 1
`percent for memory-bound applications. Less than
`a 3.5 percent performance degradation was observed
`even for CPU-intensive applications.
`To achieve the best performance, it might be nec-
`essary to compile programs for heterogeneous CMPs
`slightly differently. Compiling single-threaded appli-
`cations might involve either compiling for the low-
`est common denominator or compiling for the
`simplest core. For example, for heterogeneous CMPs
`in which one core is statically scheduled and one is
`dynamically scheduled, the compiler should sched-
`ule the code for the statically scheduled core because
`it is more sensitive to the exact instruction order than
`is a dynamically scheduled core. Having multiple
`statically scheduled cores with different levels of
`resources would present a more interesting problem.
`Programming or compiling parallel applications
`might require more awareness of the heterogeneity.
`Application developers typically assume that com-
`putational cores provide equal performance; het-
`erogeneity breaks this assumption. As a result,
`shared-memory workloads that are compiled assum-
`ing symmetric cores might have less predictable per-
`formance on heterogeneous CMPs.10 For such
`workloads, either the operating system kernel or the
`application must be heterogeneity-aware.
`Mechanisms for communicating the complete
`processor information to software and the design
`
`FURTHER RESEARCH QUESTIONS
`Many areas of future research remain for
`heterogeneous CMPs. For example, research
`to date has been performed using off-the-
`shelf cores or models of existing off-the-shelf cores.
`If designers are given the flexibility to design asym-
`metric CMP cores from scratch instead of select-
`ing from predetermined cores, how should they
`design them to complement each other best in a
`heterogeneous CMP? How do the benefits of het-
`erogeneity vary with the number of core types as a
`function of the available die area and the total
`power budget?
`In previous studies, simple cores were a strict sub-
`set of the more complex cores.5,6 What benefits are
`possible if all the resources in the cores are not
`monotonically increasing? Further, as workloads in
`enterprise environments evolve toward a model that
`consolidates multiple services on the same hardware
`infrastructure,11 heterogeneous architectures offer
`the potential to match core diversity to the diver-
`sity in the varying service-level agreements for the
`different services. What implications does this have
`on the choice and design of the individual cores?
`These are just some of the open research questions.
`However, we believe answering these and similar
`questions could show that heterogeneity is even more
`advantageous than has already been demonstrated.
`Future work should also address the impact of
`heterogeneity on the cost of design and verification.
`Processor design and verification costs are already
`high. Designing and integrating more than one type
`of core on the die would aggravate this problem.
`However, the magnitude of the power savings and
`the throughput advantages that heterogeneous
`CMPs provide might justify these costs, at least for
`limited on-chip heterogeneity.
`What is the sensitivity of heterogeneous CMP per-
`formance to the number of distinct core types? Are
`two types enough? How do these costs compare
`with the cost for other contemporary approaches
`such as different voltage levels? The answers to these
`questions might ultimately determine both the fea-
`sibility and the extent of on-chip diversity for het-
`erogeneous CMPs.
`
`Effective schedulers
`for heterogeneous
`architectures
`can be implemented
`and integrated with
`current commercial
`operating systems.
`
`November 2005
`
`37
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 7
`
`

`

`I ncreasing transistor counts constrained by power
`
`limits point to the fact that many of the current
`processor directions are inadequate. Monolithic
`processors consume too much power, but they do
`not provide enough marginal performance.
`Replicating existing processors results in a linear
`increase in power, but only a sublinear increase in
`performance. In addition to suffering from the same
`limitations, replicating smaller processors cannot
`handle high-demand and high-priority applications.
`Single-ISA heterogeneous (or asymmetric) multi-
`core architectures address all these concerns, result-
`ing in significant power and performance benefits.
`The potential benefits from heterogeneity have
`already been shown to be greater than the potential
`benefits from the individual techniques of further
`voltage scaling, clock gating, or speculation control.
`Much research remains to be done on the best
`types and degrees of heterogeneity. However, the
`advantages of heterogeneous CMPs for both
`throughput and power have been demonstrated
`conclusively. We believe that once homogeneous
`CMPs reach a total of four cores, the benefits of
`heterogeneity will outweigh the benefits of addi-
`tional homogeneous cores in many applications. I
`
`References
`1. K. Olukotun et al., “The Case for a Single-Chip Mul-
`tiprocessor,” Proc. 7th Int’l Conf. Architectural Sup-
`port for Programming Languages and Operating
`Systems (ASPLOS VII), ACM Press, 1996, pp. 2-11.
`2. L. Barroso et al., “Piranha: A Scalable Architecture
`Based on Single-Chip Multiprocessing,” Proc. 27th
`Ann. Int’l Symp. Computer Architecture, IEEE CS
`Press, 2000, pp. 282-293.
`3. E. Grochowski et al., “Best of Both Latency and
`Throughput,” Proc. Int’l Conf. Computer Design,
`IEEE CS Press, 2004, pp. 236-243.
`4. D. Pham et al., “The Design and Implementation of
`a First-Generation Cell Processor,” Proc. Int’l Symp.
`Solid-State Circuits and Systems, IEEE CS Press,
`2005, pp. 184-186.
`5. R. Kumar et al., “Single-ISA Heterogeneous Multi-
`core Architectures: The Potential for Processor Power
`Reduction,” Proc. Int’l Symp. Microarchitecture,
`IEEE CS Press, 2003, pp. 81-92.
`6. R. Kumar et al., “Single-ISA Heterogeneous Multi-
`core Architectures for Multithreaded Workload Per-
`formance,” Proc. Int’l Symp. Computer Architecture,
`IEEE CS Press, 2004, pp. 64-75.
`7. G. Amdahl, “Validity of the Single Processor
`Approach to Achieving Large-Scale Computing
`Capabilities,” Readings in Computer Architecture,
`
`M.D. Hill, N.P. Jouppi, and G.S. Sohi, eds., Morgan
`Kaufmann, 2000, pp. 79-81.
`8. M. Annavaram, E. Grochowski, and J. Shen, “Miti-
`gating Amdahl’s Law through EPI Throttling,” Proc.
`Int’l Symp. Computer Architecture, IEEE CS Press,
`2005, pp. 298-309.
`9. S. Ghiasi, T. Keller, and F. Rawson, “Scheduling for
`Heterogeneous Processors in Server Systems,” Proc.
`Computing Frontiers, ACM Press, 2005, pp. 199-210.
`10. S. Balakrishnan et al., “The Impact of Performance
`Asymmetry in Emerging Multicore Architectures,”
`Proc. Int’l Symp. Computer Architecture, IEEE CS
`Press, 2005, pp. 506-517.
`11. P. Ranganathan and N.P. Jouppi, “Enterprise IT
`Trends and Implications on System Architecture
`Research,” Proc. Int’l Conf. High-Performance Com-
`puter Architecture, IEEE CS Press, 2005, pp. 253-256.
`
`Rakesh Kumar is a PhD student in the Department
`of Computer Science and Engineering at the Uni-
`versity of California, San Diego. His research inter-
`ests include multicore and multithreaded archi-
`tectures, low-power architectures, and on-chip
`interconnects. Kumar received a BS in computer
`science and engineering from the Indian Institute
`of Technology, Kharagpur. He is a member of the
`ACM. Contact him at rakumar@cs.ucsd.edu.
`
`Dean M. Tullsen is an associate professor in the
`Department of Computer Science and Engineering
`at the University of California, San Diego. His
`research interests include instruction- and thread-
`level parallelism and multithreaded and multicore
`architectures. Tullsen received a PhD in computer
`science and engineering from the University of
`Washington. He is a member of the IEEE and the
`ACM. Contact him at tullsen@cs.ucsd.edu.
`
`Norman P. Jouppi is a Fellow at HP Labs in Palo
`Alto, Calif. His research interests include multicore
`architectures, memory systems, and cluster inter-
`connects. Jouppi received a PhD in electrical engi-
`neering from Stanford University. He is a Fellow of
`the IEEE and currently serves as chair of ACM
`SIGARCH. Contact him at norm.jouppi@hp.com.
`
`Parthasarathy Ranganathan is a principal research
`scientist at HP Labs in Palo Alto, Calif. His
`research interests include power-efficient design,
`computer architectures, and system evaluation.
`Ranganathan received a PhD in electrical and com-
`puter engineering from Rice University. He is a
`member of the IEEE and the ACM. Contact him
`at partha.ranganathan@hp.com.
`
`38
`
`Computer
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2122, p. 8
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket