`
`Reference 13
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 1
`
`
`
`Dynamically Heterogeneous Cores Through 3D Resource Pooling
`
`Houman Homayoun Vasileios Kontorinis Amirali Shayan
`
`Ta-Wei Lin Dean M. Tullsen
`
`Abstract
`This paper describes an architecture for a dynamically
`heterogeneous processor architecture leveraging 3D stack-
`ing technology. Unlike prior work in the 2D plane, the ex-
`tra dimension makes it possible to share resources at a (cid:2)ne
`granularity between vertically stacked cores. As a result,
`each core can grow or shrink resources, as needed by the
`code running on the core.
`This architecture, therefore, enables runtime customiza-
`tion of cores at a (cid:2)ne granularity and enables ef(cid:2)cient exe-
`cution at both high and low levels of thread parallelism.
`This architecture achieves performance gains from 9-
`41%, depending on the number of executing threads, and
`gains signi(cid:2)cant advantage in energy ef(cid:2)ciency of up to
`43%.
`
`1. Introduction
`Prior research [17, 19] has shown that heterogeneous
`multicore architectures provide signi(cid:2)cant advantages in
`enabling energy-ef(cid:2)cient or area-ef(cid:2)cient computing.
`It
`allows each thread to run on a core that matches its re-
`source needs more closely than a single one-size-(cid:2)ts-all
`core. However, that approach still constrains the ability to
`optimally map executing threads to cores because it relies
`on static heterogeneity, (cid:2)xed at design time.
`Other research attempts to provide dynamic heterogene-
`ity, but each face a fundamental problem. Either the
`pipeline is tightly constructed and the resources we might
`want to share are too far away to be effectively shared, or the
`shared resources are clustered and the pipeline is inef(cid:2)cient.
`As a result, most provide resource sharing or aggregation at
`a very coarse granularity (cid:150) Core Fusion [13] and TFlex [16]
`allow architects to double or quadruple the size of cores, for
`example, but do not allow a core to borrow renaming regis-
`ters from another core if that is all that is needed to acceler-
`ate execution. Thus, the heterogeneity is constrained to nar-
`row cores or wide cores, and does not allow customization
`to the speci(cid:2)c needs of the running thread. The WiDGET
`architecture [36] can only share execution units, and thus
`enables only modest pipeline in(cid:3)ation. The conjoined core
`architecture [18] shares resources between adjacent cores,
`
`University of California San Diego
`but sharing is limited by the topology of the core design to
`only those structures around the periphery of the pipeline.
`This work demonstrates that 3D stacked processor archi-
`tectures eliminate the fundamental barrier to dynamic het-
`erogeneity. Because of the extra design dimension, we can
`design a tight, optimized pipeline, yet still cluster, or pool,
`resources we might like to share between multiple cores.
`3D die stacking makes it possible to create chip mul-
`tiprocessors using multiple layers of active silicon bonded
`with low-latency, high-bandwidth, and very dense vertical
`interconnects. 3D die stacking technology provides very
`fast communication, as low as a few picoseconds [21], be-
`tween processing elements residing on different layers of
`the chip. Tightly integrating dies in the third dimension has
`already been shown to have several advantages. First, it en-
`ables the integration of heterogeneous components such as
`logic and DRAM memory [21], or analog and digital cir-
`cuits [21], fabricated in different technologies (for instance
`integration of a 65nm and a 130nm design). Second, it in-
`creases the routability [28]. Third, it substantially reduces
`wire length, which translates to lowered communication la-
`tency and reduced power consumption [21,23,28].
`The dynamically heterogeneous 3D processors we pro-
`pose in this paper provide several key bene(cid:2)ts. First, they
`enable software to run on hardware optimized for the ex-
`ecution characteristics of the running code, even for soft-
`ware the original processor designers did not envision. Sec-
`ond, they enable us to design the processor with compact,
`lightweight cores without signi(cid:2)cantly sacri(cid:2)cing general-
`purpose performance. Modern cores are typically highly
`over-provisioned [18] to guarantee good general-purpose
`performance (cid:150) if we have the ability to borrow the speci(cid:2)c
`resources a thread needs, the basic core need not be over-
`provisioned in any dimension. Third, the processor pro-
`vides true general-purpose performance, not only adapting
`to the needs of a variety of applications, but also to both
`high thread-level parallelism (enabling many area-ef(cid:2)cient
`cores) and low thread-level parallelism (enabling one or a
`few heavyweight cores).
`With a 3D architecture, we can dynamically pool re-
`sources that are potential performance bottlenecks for pos-
`sible sharing with neighboring cores. The StageNet archi-
`tecture [7] attempts to pool pipeline stage resources for re-
`
`978-1-4673-0826-7/12/$26.00 ©2011 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 2
`
`
`
`liability advantages. In that case, the limits of 2D layout
`mean that by pooling resources, the pipeline must be laid
`out inef(cid:2)ciently, resulting in very large increases in pipeline
`depth. Even Core Fusion experiences signi(cid:2)cant increases
`in pipeline depth due to communication delays in the front
`of the pipeline. With 3D integration, we can design the
`pipeline traditionally in the 2D plane, yet have poolable re-
`sources (registers, instruction queue, reorder buffer, cache
`space, load and store queues, etc.) connected along the third
`dimension on other layers. In this way, one core can borrow
`resources from another core or cores, possibly also giving
`up non-bottleneck resources the other cores need. This pa-
`per focuses on the sharing of instruction window resources.
`This architecture raises a number of performance, en-
`ergy, thermal, design, and resource allocation issues. This
`paper represents a (cid:2)rst attempt to begin to understand the
`various options and trade-offs.
`This paper is organized as follows. Section 2 describes
`our 3D architecture assumptions, both for the baseline
`multicore and our dynamically heterogeneous architecture.
`Section 3 shows that both medium-end and high-end cores
`have applications that bene(cid:2)t from increased resources, mo-
`tivating the architecture. Section 4 details the speci(cid:2)c cir-
`cuits that enable resource pooling. Section 5 describes our
`runtime hardware reallocation policies. Section 6 describes
`our experimental methodology, including our 3D models.
`Section 7 gives our performance, fairness, temperature, and
`energy results. Section 8 describes related work.
`2. Baseline Architecture
`In this section, we discuss the baseline chip multi-
`processor architecture and derive a reasonable (cid:3)oorplan
`for the 3D CMP. This (cid:3)oorplan is the basis for our
`power/temperature/area and performance modeling of vari-
`ous on-chip structures and the processor as a whole.
`3D technology, and its implications on processor archi-
`tecture, is still in the early stages of development. A number
`of design approaches are possible and many have been pro-
`posed, from alternating cores and memory/cache [20, 23],
`to folding a single pipeline across layers [27].
`In this research, we provide a new alternative to the 3D
`design space. A principal advantage of the dynamically
`heterogeneous 3D architecture is that it does not change
`the fundamental pipeline design of 2D architectures, yet
`still exploits the 3D technology to provide greater energy
`proportionality and core customization. In fact, the same
`single design could be used in 1-, 2-, and 4-layer con(cid:2)gu-
`rations, for example, providing different total core counts
`and different levels of customization and resource pooling.
`For comparison purposes, we will compare against a com-
`monly proposed approach which preserves the 2D pipeline
`design, but where core layers enable more extensive cache
`and memory.
`
`Figure 1. CMP con(cid:2)gurations: (a) baseline and (b) re-
`source pooling.
`
`2.1. Processor Model
`We study the impact of resource pooling in a quad-core
`CMP architecture. This does not re(cid:3)ect the limit of cores
`we expect on future multicore architectures, but a reason-
`able limit on 3D integration. For example, a design with
`eight cores per layer and four layers of cores would provide
`32 cores, but only clusters of four cores would be tightly
`integrated vertically. Our focus is only on the tightly inte-
`grated vertical cores.
`For the choice of core we study two types of architec-
`ture, a high-end architecture which is an aggressive super-
`scalar core with issue width of 4, and a medium-end archi-
`tecture which is an out-of-order core with issue width of
`2. For the high-end architecture we model a core similar
`to the Alpha 21264 (similar in functionality to the Intel Ne-
`halem Core, but we have more data available for validation
`on the 21264). For the medium-end architecture we con-
`(cid:2)gure core resources similar to the IBM PowerPC-750 FX
`processor [12].
`
`2.2. 3D Floorplans
`The high-level (cid:3)oorplan of our 3D quad-core CMP is
`shown in Figure 1. For our high-end processor we assume
`the same (cid:3)oorplan and same area as the Alpha 21264 [15]
`but scaled down to 45nm technology. For the medium-
`end architecture we scale down the Alpha 21264 (cid:3)oorplan
`(in 45nm) based on smaller components in many dimen-
`sions, with area scaling models similar to those described
`by Burns and Gaudiot [3].
`Moving from 2D to 3D increases power density due to
`the proximity of the active layers. As a result, tempera-
`ture is always a concern for 3D designs. Temperature-aware
`(cid:3)oorplanning has been an active topic of research in the lit-
`erature. There have been a number of 3D CMP temperature-
`aware (cid:3)oorplans proposed [5, 8, 26]. Early work in 3D ar-
`chitectures assumed that the best designs sought to alternate
`hot active logic layers with cooler cache/memory layers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 3
`
`
`
`More recent work contradicts that assumption (cid:150) it is more
`important to put the active logic layers as close as possible
`to the heat sink [39]. Therefore, an architecture that clus-
`ters active processor core layers tightly is consistent with
`this approach. Other research has also exploited this princi-
`ple. Loh, et al. [21] and Intel [1] have shown how stacking
`logic on logic in a 3D integration could improve the area
`footprint of the chip, while minimizing the clock network
`delay and eliminating many pipeline stages.
`For the rest of this work we focus on the two types of
`(cid:3)oorplan shown in Figure 1(a) and Figure 1(b). Both pre-
`serve the traditional 2D pipeline, but each provides a differ-
`ent performance, (cid:3)exibility, and temperature tradeoff.
`The thermal-aware architecture in Figure 1(a) keeps the
`pipeline logic closest to the heat-sink and does not stack
`pipeline logic on top of pipeline logic. Conversely, the
`3D dynamically heterogeneous con(cid:2)guration in Figure 1(b)
`stacks pipeline logic on top of pipeline logic, as in other
`performance-aware designs, gaining increased processor
`(cid:3)exibility through resource pooling. Notice that this com-
`parison puts our architecture in the worst possible light (cid:150) for
`example, a many-core architecture that already had multiple
`layers of cores would have very similar thermal character-
`istics to our architecture without the bene(cid:2)ts of pooling. By
`comparing with a single layer of cores, the baseline has the
`dual advantages of not having logic on top of logic, but also
`putting all cores next to the heat sink.
`3. Resource Pooling in the Third Dimension
`Dynamically scheduled processors provide various
`buffering structures that allow instructions to bypass older
`instructions stalled due to operand dependences. These
`include the instruction queue, reorder buffer,
`load-store
`queue, and renaming registers. Collectively, these resources
`de(cid:2)ne the instruction scheduling window. Larger windows
`allow the processor to more aggressively search for instruc-
`tion level parallelism.
`The focus of this work, then, is on resource adaptation
`in four major delay and performance-critical units (cid:150) the re-
`order buffer, register (cid:2)le, load/store queue, and instruction
`queue. By pooling just these resources, we create an ar-
`chitecture where an application’s scheduling window can
`grow to meet its runtime demands, potentially bene(cid:2)ting
`from other applications that do not need large windows.
`While there are a variety of resources that could be
`pooled and traded between cores (including execution units,
`cache banks, etc.), we focus in this initial study of dynam-
`ically heterogeneous 3D architectures on speci(cid:2)c circuit
`techniques that enable us to pool these structures, and dy-
`namically grow and shrink the allocation to speci(cid:2)c cores.
`In this section, we study the impact on performance of
`increasing the size of selected resources in a 3D design. We
`assume 4 cores are stacked on top of each other. The max-
`
`imum gains will be achieved when one, two, or three cores
`in our 4-core CMP are idle, freeing all of their poolable re-
`sources for possible use by running cores. The one-thread
`case represents a limit study for how much can be gained
`by pooling, but also represents a very important scenario (cid:150)
`the ability to automatically con(cid:2)gure a more powerful core
`when thread level parallelism is low. This does not repre-
`sent an unrealistic case for this architecture (cid:150) in a 2D ar-
`chitecture, the cost of quadrupling, say, the register (cid:2)le is
`high, lengthening wires signi(cid:2)cantly and moving other key
`function blocks further away from each other. In this archi-
`tecture, we are exploiting resources that are already there,
`the additional wire lengths are much smaller than in the 2D
`case, and we do not perturb the 2D pipeline layout.
`We examine two baseline architectures (details given in
`Section 6) (cid:151) a 4-issue high-end core and a 2-issue medium-
`end core. In Figure 2 we report the speedup for each of these
`core types when selected resources are doubled, tripled, and
`quadrupled (when 1, 2, and 3 cores are idle). Across most of
`the benchmarks a noticeable performance gain is observed
`with pooling. Omnetpp shows the largest performance ben-
`e(cid:2)t in medium-end cores. The largest performance is ob-
`served in swim and libquantum for high-end cores.
`Performance gains are seen with increased resources, but
`the marginal gains do drop off with larger structures. Fur-
`ther experiments (not shown) indicate that pooling beyond
`four cores provides little gain. The more scheduling re-
`sources we provide, the more likely it is that some other
`resource (e.g., the functional units, issue rate, cache) that
`we are not increasing becomes the bottleneck. In fact, this
`is true for some benchmarks right away, such as mcf and
`perlbench, where no signi(cid:2)cant gains are achieved, imply-
`ing some other bottleneck (e.g., memory latency) restricts
`throughput. On average, 13 to 26% performance improve-
`ment can be achieved for the medium-end processor, and
`21 to 45% for the high end, by increasing selected window
`resources. Most importantly, the effect of increased win-
`dow size varies dramatically by application. This motivates
`resource pooling, where we can hope to achieve high over-
`all speedup by allocating window resources where they are
`most bene(cid:2)cial.
`4. Stackable Structures for Resource Pooling
`This section describes the circuit and architectural mod-
`i(cid:2)cations required to allow resources on vertically adjacent
`cores to participate in pooling. Speci(cid:2)cally, we describe
`the changes required in each of the pipeline components.
`
`4.1. Reorder Bu(cid:11)er and Register File
`The reorder buffer (ROB) and the physical register
`(cid:2)le (RF) are multi-ported structures typically designed as
`SRAM, with the number of ports scaling with the issue
`width of the core. Our goal is to share them across multiple
`cores with minimal impact on access latency, the number of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 4
`
`
`
`(a)
`
`(b)
`Figure 2. Speedup from increasing resource size in the 3D stacked CMP with (a) medium-end and (b) high-end cores.
`
`ports, and the overall design. We take advantage of a mod-
`ular ROB (and register (cid:2)le) design proposed in [25] which
`is shown to be effective in reducing the power and com-
`plexity of a multi-ported 2D SRAM structure. Our baseline
`multi-ported ROB/RF is implemented as a number of inde-
`pendent partitions. Each partition is a self-standing and in-
`dependently usable unit, with a precharge unit, sense amps,
`and input/output drivers. Partitions are combined together
`to implement a larger ROB/RF, as shown in Figure 3(a).
`The connections running across the entries within a par-
`tition (such as the bit-lines) are connected to a common
`through line using bypass switches.
`To add a partition to the ROB/RF, the bypass switch for a
`partition is turned on. Similarly, the partition can be deallo-
`cated by turning off the corresponding bypass switch. The
`modular baseline architecture of our register (cid:2)le allows in-
`dividual partitions to participate in resource pooling. To
`avoid increasing the number of read and write ports of in-
`dividual partitions of the ROB/RF, we simply assume that
`an entire partition is always exclusively owned by one core
`(cid:151) either the core (layer) it belongs to (host core) or another
`core (guest core). This signi(cid:2)cantly simpli(cid:2)es the design,
`but restricts the granularity of sharing.
`Note that before a partition participates in resource pool-
`ing (or before it is re-assigned) we need to make sure that
`all of its entries are empty. This can be facilitated by using
`an additional bit in each row (entry) of the partition to indi-
`cate whether it is full or empty (cid:150) in most cases, that bit will
`already exist.
`Figure 3(b) shows a logical view of two stacked register
`
`(cid:2)les, participating in resource pooling (only one partition
`of the RF from each layer is shown in this (cid:2)gure). The ad-
`ditional multiplexers and decoder shown in Figure 3(b) are
`used to route the address and data from/to a partition in one
`layer from/to another partition in a different layer. The de-
`coder shown in the (cid:2)gure enables stacking of the ROB/RF.
`To be able to pool up to 4 ROB/RF partitions on four dif-
`ferent layers together, we need to use a 4-1 decoder and a
`4-1 multiplexer. The register operand tag is also extended
`with 2 additional bits. The overall delay added to the ROB
`or RF due to additional multiplexing and decoding is fairly
`small. For the case of stacking four cores where a 4 in-
`put decoder/multiplexer is needed, the additional delay is
`found to be below 20 ps (using SPICE simulation and as-
`suming a standard cell 4 input multiplexer). In this design,
`the access latency of the original register (cid:2)le is only 280ps
`(using CACTI for an 8 read-port, 4 write-port, 64 entry reg-
`ister (cid:2)le). The additional 20 ps delay due to an additional
`decoder/multiplexer and the TSVs (5ps at most) still keep
`the overall delay below one processor cycle. Thus, the fre-
`quency is not impacted. For the ROB, the baseline delay is
`230 ps and the additional delay can still be tolerated, given
`our baseline architectural assumptions.
`
`Due to the circular FIFO nature of the ROB, an addi-
`tional design consideration to implement resource sharing is
`required, which is not needed for the register (cid:2)le. The ROB
`can be logically viewed as a circular FIFO with head and
`tail pointers. The tail pointer points to the beginning of the
`free entry of the ROB where new dispatch instructions can
`be allocated. The instructions are committed from the head
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 5
`
`
`
`Throughline
`
`memory cell array
`
`sense-amps
`
`input/output drivers
`bypass switch array
`
`memory cell array
`
`sense-amps
`
`input/output drivers
`bypass switch array
`
`bitline
`
`Bypass
`switch
`
`(a)
`
`(b)
`
`Figure 3. (a) Partitioned ROB and RF design, (b) logical view of two stacked RF(ROB) partitions.
`tors per sub-tagline. As a side effect, the large number of
`pointer. Resource sharing requires dynamically adjusting
`segments increases the area and power overhead [14].
`the size of the reorder buffer. To implement such dynamic
`resizing we use the technique proposed in [25], where two
`To be able to share two or more partitions of the in-
`additional pointers are added to the ROB to dynamically
`struction queue, we include one multiplexer per tagline and
`adjust its size.
`per IQ partition to select between the local tagline and the
`global taglines (shown in Figure 4(c)). Similarly to the RF,
`4.2. Instruction Queue and Ld/St Queue
`to avoid increasing the number of taglines we simply as-
`Both the Instruction Queue (IQ) and the Load/Store
`sume that each partition is always allocated exclusively to
`Queue (LSQ) are CAM+SRAM structures which hold in-
`a single core. This way the number of taglines remains the
`structions until they can be issued. The main complexity of
`same and multiplexing, as shown in Figure 4(c), will route
`the IQ and LSQ stems from the associative search during
`the data on the tagline to the right partition. For the SRAM
`the wakeup process [24]. Due to large power dissipation
`payload of the instruction queue we simply follow the same
`and large operation delay, the size of these units does not
`modi(cid:2)cation proposed for our SRAM register (cid:2)le. Bitline
`scale well in a 2D design. The number of instruction queue
`segmentation helps to reduce the number of die-to-die vias
`and LSQ entries has not changed signi(cid:2)cantly in recent gen-
`required for communication between two layers.
`erations of 2D processors.
`We also need to modify the instruction selection logic.
`Figure 4(a) shows a conventional implementation of the
`Increasing the maximum size of the instruction queue in-
`instruction queue. The taglines run across the queue and
`creases the complexity of the selection logic [24]. In a typi-
`every cycle the matchline compares the tagline value broad-
`cal superscalar processor each instruction queue entry has a
`cast by the functional units with the instruction queue en-
`set of bid and grant ports to communicate with the selection
`try (source operand). We assume our baseline IQ utilizes
`logic. Increasing the size of the IQ increases the number
`the well-studied divided tagline (bitline) technique [14]. As
`of input ports of the selection logic which can negatively
`shown in Figure 4(b), two or more IQ entries are combined
`impact the clock frequency. To avoid increasing the com-
`together to form a partition and to divide the global tag line
`plexity of the selection logic, we simply allow all partitions
`into several sub-tag lines. This way the IQ is divided into
`participating in resource pooling to share the same selection
`multiple partitions. In the non-divided tag line structure the
`logic port along with the partition that belongs to the guest
`tag line capacitance is N * diffusion capacitance of pass
`core (layer). In this case, we OR the bid signals (from the
`transistors + wire capacitance (usually 10 to 20% of total
`shared partition and the guest core partition) to the selection
`diffusion capacitance) where N is the total number of rows.
`logic. The priority is given to the older entry (age-based pri-
`In the divided tag line scheme the equivalent tagline capac-
`ority decoding).
`itance is greatly reduced and is approximated as M * diffu-
`sion capacitance + 2 * wire capacitance, where M is the
`The overall delay overhead in the selection logic is de-
`cided by the ORing operation and the age-based priority de-
`number of tagline segments. As tagline dynamic power dis-
`coding. Note that the ORing of the bid signals only slightly
`sipation is proportional to CV 2, reducing the effective ca-
`increases the selection logic delay, by less than 20 ps (us-
`pacitance will linearly reduce tagline dynamic power. The
`ing SPICE simulation). This delay does not increase the
`overhead of this technique is adding a set of pass transis-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 6
`
`
`
`Baseline
`
` tagline
`
`
`
`XORXOR
`
`
`
`XORXOR
`
`
`
`XORXOR
`
` tagline
`
`Sub-
`tagline
`
`Layer 1
`(flipped)
`
`
`
`XORXOR
`
`
`
`XORXOR
`
`XOR
`
`XOR
`
`Layer 1
`(flipped)
`
`XOR
`
`XOR
`
`XOR
`
`XOR
`
`tagline
`Layer 1
`
`Layer 0
`
`XOR
`
`XOR
`
`XOR
`
`XOR
`
`tagline
`Layer 2
`
`Local
`tagline
`
`global
`tagline
`
`MUX
`
`MUX
`
`MUX
`
`(d)
`(c)
`(b)
`(a)
`(c)
`(b)
`(a)
`Figure 4. (a) Conventional implementation of the IQ, (b) partitioned IQ using divided tagline, (c) implementation of the stacked
`IQ, (d) logical view of the stacked instruction queue.
`selection logic access delay beyond a single clock period.
`For the age-based priority decoding we propose the follow-
`ing to hide its delay: we perform the age-priority compu-
`tation in parallel with the selection logic (to overlap their
`delays). When the grant signal comes back, we use the now
`pre-computed age information to decide where to route the
`grant.
`Under the given assumptions, this analysis indicates we
`can add the pooling logic without impacting cycle time;
`however, it is possible that under different assumptions, on
`different designs, these overheads could be exposed. We
`will examine the potential impact in the results section.
`5. Adaptive Mechanism for Resource Pooling
`In addition to the circuit modi(cid:2)cations that are neces-
`sary to allow resource aggregation across dies, we also need
`mechanisms and policies to control the pooling or sharing
`of resources.
`In devising policies to manage the many new shared re-
`sources in this architecture, we would like to maximize
`(cid:3)exibility; however, design considerations limit the gran-
`ularity (both in time and space) at which we can partition
`core resources. Time is actually the easier issue. Because
`the aggregated structures are quite compact (in total 3D
`distance), we can reallocate partitions between cores very
`quickly, within a cycle or cycles. To reduce circuit complex-
`ity, we expect to physically repartition on a more coarse-
`grain boundary (e.g., four or eight entries rather than single
`entries).
`In the results section, we experiment with a variety
`of size granularities for reallocation of pooled resources.
`Large partitions both restrict the (cid:3)exibility of pooling and
`also tend to lengthen the latency to free resources. We also
`vary how aggressively the system is allowed to reallocate
`
`resources; speci(cid:2)cally, we explore various static settings for
`the minimum (MIN) and the maximum (MAX) value for the
`size of a partition, which determine the (cid:3)oor and the ceiling
`for core resource allocation.
`Our baseline allocation strategy exploits two principles.
`First, we need to be able to allocate resources quickly. Thus,
`we cannot reassign active partitions, which could take hun-
`dreds of cycles or more to clear active state.
`Instead we
`actively harvest empty partitions into a free list, from which
`they can later be assigned quickly. Second, because we
`can allocate resources quickly, we need not wait to harvest
`empty partitions (cid:151) we grab them immediately. This works
`because even if the same core needs the resource again right
`away, it can typically get it back in a few cycles.
`We assume a central arbitration point for the (free)
`pooled resources. A thread will request additional parti-
`tions when a resource is full. If available (on the list of free
`partitions), and the thread is not yet at its MAX value, those
`resources can be allocated upon request. As soon as a parti-
`tion has been found to be empty it is returned to the free list
`(unless the size of the resource is at MIN). The architecture
`could adjust MIN and MAX at intervals depending on the
`behavior of a thread, but this will be the focus of future work
`(cid:150) for now we (cid:2)nd static values of MIN and MAX to perform
`well. If two cores request resources in the same cycle, we
`use a simple round-robin priority scheme to arbitrate.
`6. Methodology
`In order to evaluate different resource adaptation poli-
`cies, we add support for dynamic adaptation to the SMT-
`SIM simulator [34], con(cid:2)gured for multicore simulation.
`Our power models use a methodology similar to [2]. We
`capture the energy per access and leakage power dissipation
`for individual SRAM units using CACTI-5.1 [33] targeting
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 7
`
`
`
`45nm technology. The energy and power consumption for
`each unit is computed by multiplying access counts by the
`per-access SRAM energy. For temperature calculation we
`use Hotspot 5.0 [29].
`Table 1 gives the characteristics of our baseline core ar-
`chitectures. Note that for each of the register (cid:2)les, 32 reg-
`isters are assumed to be unavailable for pooling, as they are
`needed for the storage of architectural registers.
`
`6.1. Modeling 3D Stacked Interconnect for
`Resource Pooling
`We model Tier-to-Tier (T2T) connection with Through
`Silicon Vias (TSV). TSVs enable low-latency, high-
`bandwidth, and very dense vertical interconnect among the
`pooled blocks across multiple layers of active silicon. We
`assume four dies are stacked on top of each other. Each tier
`has an Alpha processor (high-end core case) with die size of
`6:4mm (cid:2) 6:4mm with 12 layers of metal from M1 to M12
`and the redistribution layer (RDL). The 3D stacked chip
`model is (cid:3)ip chip technology and the tiers are connected
`face-to-back. In the face-to-back connection, the RDL of
`Tier 1 (T1) is connected to the package via (cid:3)ip chip bumps,
`and the RDL of Tier 2 (T2) is connected to the M1 of T1
`via TSV and forms the T2T connection.
`Each core is placed in a single tier of the stack. TSVs
`connect the Register File (RF), Instruction Queue (IQ), Re-
`order Buffer (ROB), and Load and Store Queue (LSQ) of
`each layer vertically. The connection from bottom tier M1
`to M12 and RDL layer of the top tier is via TSV, and from
`M12 and RDL is with resistive via and local routing to the
`M1 of the sink in RF, IQ, ROB and LSQ.
`The resistive through metal via connects metal layers of
`the tiers, e.g., M1 to M2 in each tier. The vertical and
`horizontal parasitics of the metals, via, and TSV connec-
`tions have been extracted to build the interconnect model.
`A T2T connection includes a through silicon via and a
`(cid:22)bump. The parasitics of the (cid:22)bumps are small compared
`with the TSV [10]. Hence, we only model the parasitics
`of the TSVs for T2T connections. The length, diameter,
`and dielectric linear thickness of the TSV which is used for
`the T2T connection in our model are, respectively, 50(cid:22)m,
`5(cid:22)m, and 0:12(cid:22)m. A TSV is modeled as an RLC element
`with RL in series and C connected to the substrate, i.e.,
`global ground in our model. The parasitic resistance, capac-
`itance, and inductance of the T2T connections are modeled
`by RT SV =47m(cid:10), LT SV =34pH, and CT SV =88f F [11].
`The power and signal TSVs connect the power/ground
`mesh from the package (cid:3)ip chip bumps to each layer. The
`TSV pitch for the tier to tier connection is assumed to be
`uniformly distributed with a density of 80=mm2 [11]. We
`assume the TSV structures are via-last where the TSV is on
`top of the back end of the line (BEOL), i.e., RDL layer and
`the M1.
`
`Cores
`Issue,Commit width
`INT instruction queue
`FP instruction queue
`Reorder Buffer entries
`INT registers
`FP registers
`Functional units
`L1 cache
`L2 cache (priv)
`L3 cache (shared)
`L3 miss penalty
`Frequency
`Vdd
`
`Medium-End Core
`4
`2
`16 entries
`16 entries
`32 entries
`48
`48
`2 int/ldst 1 fp
`16KB, 4-way, 2 cyc
`256KB, 4-way, 10 cyc
`4MB, 4-way, 20 cyc
`250 cyc
`2GHz
`1.0V
`
`High-End Core
`4
`4
`32 entries
`32 entries
`64 entries
`64
`64
`4 int/ldst 2 fp
`32KB, 4-way, 2 cyc
`512KB, 4-way, 15 cyc
`8MB, 8-way, 30 cyc
`250 cyc
`2GHz
`1.0V
`
`Table 1. Architectural speci(cid:2)cation.
`
`Tier to Tier Path
`T1 to T2
`T1 to T3
`T1 to T4
`T2 to T3
`T2 to T4
`T3 to T4
`
`Delay (ps)
`1.26
`2.11
`2.53
`1.31
`2.19
`1.35
`
`Table 2. Tier to tier delay via TSV path.
`
`In our circuit model we extract the delay path from each
`SRAM pin (SRAM pin is a signal bump on top of the RDL
`layer) to the multiplexer of the next SRAM pin. The de-
`lay timing for each tier is around 1-2.5 ps as illustrated in
`Table 2 for tier 1 to 4.
`The TSV lands on the (cid:22)bump and landing pad. Sur-
`rounding the TSV and landing pad there is a keep-out area
`where no block, i.e., standard cell is allowed to place and
`route. We estimate the total TSVs required for connecting
`memory pins of the RF, IQ, ROB, and LSQ vertically for
`different stack up numbers in both medium-end and high-
`end cores. The total area for the TSV landing pad and the
`block out area is calculated an