throbber
Homayoun
`
`Reference 13
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 1
`
`

`

`Dynamically Heterogeneous Cores Through 3D Resource Pooling
`
`Houman Homayoun Vasileios Kontorinis Amirali Shayan
`
`Ta-Wei Lin Dean M. Tullsen
`
`Abstract
`This paper describes an architecture for a dynamically
`heterogeneous processor architecture leveraging 3D stack-
`ing technology. Unlike prior work in the 2D plane, the ex-
`tra dimension makes it possible to share resources at a (cid:2)ne
`granularity between vertically stacked cores. As a result,
`each core can grow or shrink resources, as needed by the
`code running on the core.
`This architecture, therefore, enables runtime customiza-
`tion of cores at a (cid:2)ne granularity and enables ef(cid:2)cient exe-
`cution at both high and low levels of thread parallelism.
`This architecture achieves performance gains from 9-
`41%, depending on the number of executing threads, and
`gains signi(cid:2)cant advantage in energy ef(cid:2)ciency of up to
`43%.
`
`1. Introduction
`Prior research [17, 19] has shown that heterogeneous
`multicore architectures provide signi(cid:2)cant advantages in
`enabling energy-ef(cid:2)cient or area-ef(cid:2)cient computing.
`It
`allows each thread to run on a core that matches its re-
`source needs more closely than a single one-size-(cid:2)ts-all
`core. However, that approach still constrains the ability to
`optimally map executing threads to cores because it relies
`on static heterogeneity, (cid:2)xed at design time.
`Other research attempts to provide dynamic heterogene-
`ity, but each face a fundamental problem. Either the
`pipeline is tightly constructed and the resources we might
`want to share are too far away to be effectively shared, or the
`shared resources are clustered and the pipeline is inef(cid:2)cient.
`As a result, most provide resource sharing or aggregation at
`a very coarse granularity (cid:150) Core Fusion [13] and TFlex [16]
`allow architects to double or quadruple the size of cores, for
`example, but do not allow a core to borrow renaming regis-
`ters from another core if that is all that is needed to acceler-
`ate execution. Thus, the heterogeneity is constrained to nar-
`row cores or wide cores, and does not allow customization
`to the speci(cid:2)c needs of the running thread. The WiDGET
`architecture [36] can only share execution units, and thus
`enables only modest pipeline in(cid:3)ation. The conjoined core
`architecture [18] shares resources between adjacent cores,
`
`University of California San Diego
`but sharing is limited by the topology of the core design to
`only those structures around the periphery of the pipeline.
`This work demonstrates that 3D stacked processor archi-
`tectures eliminate the fundamental barrier to dynamic het-
`erogeneity. Because of the extra design dimension, we can
`design a tight, optimized pipeline, yet still cluster, or pool,
`resources we might like to share between multiple cores.
`3D die stacking makes it possible to create chip mul-
`tiprocessors using multiple layers of active silicon bonded
`with low-latency, high-bandwidth, and very dense vertical
`interconnects. 3D die stacking technology provides very
`fast communication, as low as a few picoseconds [21], be-
`tween processing elements residing on different layers of
`the chip. Tightly integrating dies in the third dimension has
`already been shown to have several advantages. First, it en-
`ables the integration of heterogeneous components such as
`logic and DRAM memory [21], or analog and digital cir-
`cuits [21], fabricated in different technologies (for instance
`integration of a 65nm and a 130nm design). Second, it in-
`creases the routability [28]. Third, it substantially reduces
`wire length, which translates to lowered communication la-
`tency and reduced power consumption [21,23,28].
`The dynamically heterogeneous 3D processors we pro-
`pose in this paper provide several key bene(cid:2)ts. First, they
`enable software to run on hardware optimized for the ex-
`ecution characteristics of the running code, even for soft-
`ware the original processor designers did not envision. Sec-
`ond, they enable us to design the processor with compact,
`lightweight cores without signi(cid:2)cantly sacri(cid:2)cing general-
`purpose performance. Modern cores are typically highly
`over-provisioned [18] to guarantee good general-purpose
`performance (cid:150) if we have the ability to borrow the speci(cid:2)c
`resources a thread needs, the basic core need not be over-
`provisioned in any dimension. Third, the processor pro-
`vides true general-purpose performance, not only adapting
`to the needs of a variety of applications, but also to both
`high thread-level parallelism (enabling many area-ef(cid:2)cient
`cores) and low thread-level parallelism (enabling one or a
`few heavyweight cores).
`With a 3D architecture, we can dynamically pool re-
`sources that are potential performance bottlenecks for pos-
`sible sharing with neighboring cores. The StageNet archi-
`tecture [7] attempts to pool pipeline stage resources for re-
`
`978-1-4673-0826-7/12/$26.00 ©2011 IEEE
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 2
`
`

`

`liability advantages. In that case, the limits of 2D layout
`mean that by pooling resources, the pipeline must be laid
`out inef(cid:2)ciently, resulting in very large increases in pipeline
`depth. Even Core Fusion experiences signi(cid:2)cant increases
`in pipeline depth due to communication delays in the front
`of the pipeline. With 3D integration, we can design the
`pipeline traditionally in the 2D plane, yet have poolable re-
`sources (registers, instruction queue, reorder buffer, cache
`space, load and store queues, etc.) connected along the third
`dimension on other layers. In this way, one core can borrow
`resources from another core or cores, possibly also giving
`up non-bottleneck resources the other cores need. This pa-
`per focuses on the sharing of instruction window resources.
`This architecture raises a number of performance, en-
`ergy, thermal, design, and resource allocation issues. This
`paper represents a (cid:2)rst attempt to begin to understand the
`various options and trade-offs.
`This paper is organized as follows. Section 2 describes
`our 3D architecture assumptions, both for the baseline
`multicore and our dynamically heterogeneous architecture.
`Section 3 shows that both medium-end and high-end cores
`have applications that bene(cid:2)t from increased resources, mo-
`tivating the architecture. Section 4 details the speci(cid:2)c cir-
`cuits that enable resource pooling. Section 5 describes our
`runtime hardware reallocation policies. Section 6 describes
`our experimental methodology, including our 3D models.
`Section 7 gives our performance, fairness, temperature, and
`energy results. Section 8 describes related work.
`2. Baseline Architecture
`In this section, we discuss the baseline chip multi-
`processor architecture and derive a reasonable (cid:3)oorplan
`for the 3D CMP. This (cid:3)oorplan is the basis for our
`power/temperature/area and performance modeling of vari-
`ous on-chip structures and the processor as a whole.
`3D technology, and its implications on processor archi-
`tecture, is still in the early stages of development. A number
`of design approaches are possible and many have been pro-
`posed, from alternating cores and memory/cache [20, 23],
`to folding a single pipeline across layers [27].
`In this research, we provide a new alternative to the 3D
`design space. A principal advantage of the dynamically
`heterogeneous 3D architecture is that it does not change
`the fundamental pipeline design of 2D architectures, yet
`still exploits the 3D technology to provide greater energy
`proportionality and core customization. In fact, the same
`single design could be used in 1-, 2-, and 4-layer con(cid:2)gu-
`rations, for example, providing different total core counts
`and different levels of customization and resource pooling.
`For comparison purposes, we will compare against a com-
`monly proposed approach which preserves the 2D pipeline
`design, but where core layers enable more extensive cache
`and memory.
`
`Figure 1. CMP con(cid:2)gurations: (a) baseline and (b) re-
`source pooling.
`
`2.1. Processor Model
`We study the impact of resource pooling in a quad-core
`CMP architecture. This does not re(cid:3)ect the limit of cores
`we expect on future multicore architectures, but a reason-
`able limit on 3D integration. For example, a design with
`eight cores per layer and four layers of cores would provide
`32 cores, but only clusters of four cores would be tightly
`integrated vertically. Our focus is only on the tightly inte-
`grated vertical cores.
`For the choice of core we study two types of architec-
`ture, a high-end architecture which is an aggressive super-
`scalar core with issue width of 4, and a medium-end archi-
`tecture which is an out-of-order core with issue width of
`2. For the high-end architecture we model a core similar
`to the Alpha 21264 (similar in functionality to the Intel Ne-
`halem Core, but we have more data available for validation
`on the 21264). For the medium-end architecture we con-
`(cid:2)gure core resources similar to the IBM PowerPC-750 FX
`processor [12].
`
`2.2. 3D Floorplans
`The high-level (cid:3)oorplan of our 3D quad-core CMP is
`shown in Figure 1. For our high-end processor we assume
`the same (cid:3)oorplan and same area as the Alpha 21264 [15]
`but scaled down to 45nm technology. For the medium-
`end architecture we scale down the Alpha 21264 (cid:3)oorplan
`(in 45nm) based on smaller components in many dimen-
`sions, with area scaling models similar to those described
`by Burns and Gaudiot [3].
`Moving from 2D to 3D increases power density due to
`the proximity of the active layers. As a result, tempera-
`ture is always a concern for 3D designs. Temperature-aware
`(cid:3)oorplanning has been an active topic of research in the lit-
`erature. There have been a number of 3D CMP temperature-
`aware (cid:3)oorplans proposed [5, 8, 26]. Early work in 3D ar-
`chitectures assumed that the best designs sought to alternate
`hot active logic layers with cooler cache/memory layers.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 3
`
`

`

`More recent work contradicts that assumption (cid:150) it is more
`important to put the active logic layers as close as possible
`to the heat sink [39]. Therefore, an architecture that clus-
`ters active processor core layers tightly is consistent with
`this approach. Other research has also exploited this princi-
`ple. Loh, et al. [21] and Intel [1] have shown how stacking
`logic on logic in a 3D integration could improve the area
`footprint of the chip, while minimizing the clock network
`delay and eliminating many pipeline stages.
`For the rest of this work we focus on the two types of
`(cid:3)oorplan shown in Figure 1(a) and Figure 1(b). Both pre-
`serve the traditional 2D pipeline, but each provides a differ-
`ent performance, (cid:3)exibility, and temperature tradeoff.
`The thermal-aware architecture in Figure 1(a) keeps the
`pipeline logic closest to the heat-sink and does not stack
`pipeline logic on top of pipeline logic. Conversely, the
`3D dynamically heterogeneous con(cid:2)guration in Figure 1(b)
`stacks pipeline logic on top of pipeline logic, as in other
`performance-aware designs, gaining increased processor
`(cid:3)exibility through resource pooling. Notice that this com-
`parison puts our architecture in the worst possible light (cid:150) for
`example, a many-core architecture that already had multiple
`layers of cores would have very similar thermal character-
`istics to our architecture without the bene(cid:2)ts of pooling. By
`comparing with a single layer of cores, the baseline has the
`dual advantages of not having logic on top of logic, but also
`putting all cores next to the heat sink.
`3. Resource Pooling in the Third Dimension
`Dynamically scheduled processors provide various
`buffering structures that allow instructions to bypass older
`instructions stalled due to operand dependences. These
`include the instruction queue, reorder buffer,
`load-store
`queue, and renaming registers. Collectively, these resources
`de(cid:2)ne the instruction scheduling window. Larger windows
`allow the processor to more aggressively search for instruc-
`tion level parallelism.
`The focus of this work, then, is on resource adaptation
`in four major delay and performance-critical units (cid:150) the re-
`order buffer, register (cid:2)le, load/store queue, and instruction
`queue. By pooling just these resources, we create an ar-
`chitecture where an application’s scheduling window can
`grow to meet its runtime demands, potentially bene(cid:2)ting
`from other applications that do not need large windows.
`While there are a variety of resources that could be
`pooled and traded between cores (including execution units,
`cache banks, etc.), we focus in this initial study of dynam-
`ically heterogeneous 3D architectures on speci(cid:2)c circuit
`techniques that enable us to pool these structures, and dy-
`namically grow and shrink the allocation to speci(cid:2)c cores.
`In this section, we study the impact on performance of
`increasing the size of selected resources in a 3D design. We
`assume 4 cores are stacked on top of each other. The max-
`
`imum gains will be achieved when one, two, or three cores
`in our 4-core CMP are idle, freeing all of their poolable re-
`sources for possible use by running cores. The one-thread
`case represents a limit study for how much can be gained
`by pooling, but also represents a very important scenario (cid:150)
`the ability to automatically con(cid:2)gure a more powerful core
`when thread level parallelism is low. This does not repre-
`sent an unrealistic case for this architecture (cid:150) in a 2D ar-
`chitecture, the cost of quadrupling, say, the register (cid:2)le is
`high, lengthening wires signi(cid:2)cantly and moving other key
`function blocks further away from each other. In this archi-
`tecture, we are exploiting resources that are already there,
`the additional wire lengths are much smaller than in the 2D
`case, and we do not perturb the 2D pipeline layout.
`We examine two baseline architectures (details given in
`Section 6) (cid:151) a 4-issue high-end core and a 2-issue medium-
`end core. In Figure 2 we report the speedup for each of these
`core types when selected resources are doubled, tripled, and
`quadrupled (when 1, 2, and 3 cores are idle). Across most of
`the benchmarks a noticeable performance gain is observed
`with pooling. Omnetpp shows the largest performance ben-
`e(cid:2)t in medium-end cores. The largest performance is ob-
`served in swim and libquantum for high-end cores.
`Performance gains are seen with increased resources, but
`the marginal gains do drop off with larger structures. Fur-
`ther experiments (not shown) indicate that pooling beyond
`four cores provides little gain. The more scheduling re-
`sources we provide, the more likely it is that some other
`resource (e.g., the functional units, issue rate, cache) that
`we are not increasing becomes the bottleneck. In fact, this
`is true for some benchmarks right away, such as mcf and
`perlbench, where no signi(cid:2)cant gains are achieved, imply-
`ing some other bottleneck (e.g., memory latency) restricts
`throughput. On average, 13 to 26% performance improve-
`ment can be achieved for the medium-end processor, and
`21 to 45% for the high end, by increasing selected window
`resources. Most importantly, the effect of increased win-
`dow size varies dramatically by application. This motivates
`resource pooling, where we can hope to achieve high over-
`all speedup by allocating window resources where they are
`most bene(cid:2)cial.
`4. Stackable Structures for Resource Pooling
`This section describes the circuit and architectural mod-
`i(cid:2)cations required to allow resources on vertically adjacent
`cores to participate in pooling. Speci(cid:2)cally, we describe
`the changes required in each of the pipeline components.
`
`4.1. Reorder Bu(cid:11)er and Register File
`The reorder buffer (ROB) and the physical register
`(cid:2)le (RF) are multi-ported structures typically designed as
`SRAM, with the number of ports scaling with the issue
`width of the core. Our goal is to share them across multiple
`cores with minimal impact on access latency, the number of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 4
`
`

`

`(a)
`
`(b)
`Figure 2. Speedup from increasing resource size in the 3D stacked CMP with (a) medium-end and (b) high-end cores.
`
`ports, and the overall design. We take advantage of a mod-
`ular ROB (and register (cid:2)le) design proposed in [25] which
`is shown to be effective in reducing the power and com-
`plexity of a multi-ported 2D SRAM structure. Our baseline
`multi-ported ROB/RF is implemented as a number of inde-
`pendent partitions. Each partition is a self-standing and in-
`dependently usable unit, with a precharge unit, sense amps,
`and input/output drivers. Partitions are combined together
`to implement a larger ROB/RF, as shown in Figure 3(a).
`The connections running across the entries within a par-
`tition (such as the bit-lines) are connected to a common
`through line using bypass switches.
`To add a partition to the ROB/RF, the bypass switch for a
`partition is turned on. Similarly, the partition can be deallo-
`cated by turning off the corresponding bypass switch. The
`modular baseline architecture of our register (cid:2)le allows in-
`dividual partitions to participate in resource pooling. To
`avoid increasing the number of read and write ports of in-
`dividual partitions of the ROB/RF, we simply assume that
`an entire partition is always exclusively owned by one core
`(cid:151) either the core (layer) it belongs to (host core) or another
`core (guest core). This signi(cid:2)cantly simpli(cid:2)es the design,
`but restricts the granularity of sharing.
`Note that before a partition participates in resource pool-
`ing (or before it is re-assigned) we need to make sure that
`all of its entries are empty. This can be facilitated by using
`an additional bit in each row (entry) of the partition to indi-
`cate whether it is full or empty (cid:150) in most cases, that bit will
`already exist.
`Figure 3(b) shows a logical view of two stacked register
`
`(cid:2)les, participating in resource pooling (only one partition
`of the RF from each layer is shown in this (cid:2)gure). The ad-
`ditional multiplexers and decoder shown in Figure 3(b) are
`used to route the address and data from/to a partition in one
`layer from/to another partition in a different layer. The de-
`coder shown in the (cid:2)gure enables stacking of the ROB/RF.
`To be able to pool up to 4 ROB/RF partitions on four dif-
`ferent layers together, we need to use a 4-1 decoder and a
`4-1 multiplexer. The register operand tag is also extended
`with 2 additional bits. The overall delay added to the ROB
`or RF due to additional multiplexing and decoding is fairly
`small. For the case of stacking four cores where a 4 in-
`put decoder/multiplexer is needed, the additional delay is
`found to be below 20 ps (using SPICE simulation and as-
`suming a standard cell 4 input multiplexer). In this design,
`the access latency of the original register (cid:2)le is only 280ps
`(using CACTI for an 8 read-port, 4 write-port, 64 entry reg-
`ister (cid:2)le). The additional 20 ps delay due to an additional
`decoder/multiplexer and the TSVs (5ps at most) still keep
`the overall delay below one processor cycle. Thus, the fre-
`quency is not impacted. For the ROB, the baseline delay is
`230 ps and the additional delay can still be tolerated, given
`our baseline architectural assumptions.
`
`Due to the circular FIFO nature of the ROB, an addi-
`tional design consideration to implement resource sharing is
`required, which is not needed for the register (cid:2)le. The ROB
`can be logically viewed as a circular FIFO with head and
`tail pointers. The tail pointer points to the beginning of the
`free entry of the ROB where new dispatch instructions can
`be allocated. The instructions are committed from the head
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 5
`
`

`

`Throughline
`
`memory cell array
`
`sense-amps
`
`input/output drivers
`bypass switch array
`
`memory cell array
`
`sense-amps
`
`input/output drivers
`bypass switch array
`
`bitline
`
`Bypass
`switch
`
`(a)
`
`(b)
`
`Figure 3. (a) Partitioned ROB and RF design, (b) logical view of two stacked RF(ROB) partitions.
`tors per sub-tagline. As a side effect, the large number of
`pointer. Resource sharing requires dynamically adjusting
`segments increases the area and power overhead [14].
`the size of the reorder buffer. To implement such dynamic
`resizing we use the technique proposed in [25], where two
`To be able to share two or more partitions of the in-
`additional pointers are added to the ROB to dynamically
`struction queue, we include one multiplexer per tagline and
`adjust its size.
`per IQ partition to select between the local tagline and the
`global taglines (shown in Figure 4(c)). Similarly to the RF,
`4.2. Instruction Queue and Ld/St Queue
`to avoid increasing the number of taglines we simply as-
`Both the Instruction Queue (IQ) and the Load/Store
`sume that each partition is always allocated exclusively to
`Queue (LSQ) are CAM+SRAM structures which hold in-
`a single core. This way the number of taglines remains the
`structions until they can be issued. The main complexity of
`same and multiplexing, as shown in Figure 4(c), will route
`the IQ and LSQ stems from the associative search during
`the data on the tagline to the right partition. For the SRAM
`the wakeup process [24]. Due to large power dissipation
`payload of the instruction queue we simply follow the same
`and large operation delay, the size of these units does not
`modi(cid:2)cation proposed for our SRAM register (cid:2)le. Bitline
`scale well in a 2D design. The number of instruction queue
`segmentation helps to reduce the number of die-to-die vias
`and LSQ entries has not changed signi(cid:2)cantly in recent gen-
`required for communication between two layers.
`erations of 2D processors.
`We also need to modify the instruction selection logic.
`Figure 4(a) shows a conventional implementation of the
`Increasing the maximum size of the instruction queue in-
`instruction queue. The taglines run across the queue and
`creases the complexity of the selection logic [24]. In a typi-
`every cycle the matchline compares the tagline value broad-
`cal superscalar processor each instruction queue entry has a
`cast by the functional units with the instruction queue en-
`set of bid and grant ports to communicate with the selection
`try (source operand). We assume our baseline IQ utilizes
`logic. Increasing the size of the IQ increases the number
`the well-studied divided tagline (bitline) technique [14]. As
`of input ports of the selection logic which can negatively
`shown in Figure 4(b), two or more IQ entries are combined
`impact the clock frequency. To avoid increasing the com-
`together to form a partition and to divide the global tag line
`plexity of the selection logic, we simply allow all partitions
`into several sub-tag lines. This way the IQ is divided into
`participating in resource pooling to share the same selection
`multiple partitions. In the non-divided tag line structure the
`logic port along with the partition that belongs to the guest
`tag line capacitance is N * diffusion capacitance of pass
`core (layer). In this case, we OR the bid signals (from the
`transistors + wire capacitance (usually 10 to 20% of total
`shared partition and the guest core partition) to the selection
`diffusion capacitance) where N is the total number of rows.
`logic. The priority is given to the older entry (age-based pri-
`In the divided tag line scheme the equivalent tagline capac-
`ority decoding).
`itance is greatly reduced and is approximated as M * diffu-
`sion capacitance + 2 * wire capacitance, where M is the
`The overall delay overhead in the selection logic is de-
`cided by the ORing operation and the age-based priority de-
`number of tagline segments. As tagline dynamic power dis-
`coding. Note that the ORing of the bid signals only slightly
`sipation is proportional to CV 2, reducing the effective ca-
`increases the selection logic delay, by less than 20 ps (us-
`pacitance will linearly reduce tagline dynamic power. The
`ing SPICE simulation). This delay does not increase the
`overhead of this technique is adding a set of pass transis-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 6
`
`

`

`Baseline
`
` tagline
`
`
`
`XORXOR
`
`
`
`XORXOR
`
`
`
`XORXOR
`
` tagline
`
`Sub-
`tagline
`
`Layer 1
`(flipped)
`
`
`
`XORXOR
`
`
`
`XORXOR
`
`XOR
`
`XOR
`
`Layer 1
`(flipped)
`
`XOR
`
`XOR
`
`XOR
`
`XOR
`
`tagline
`Layer 1
`
`Layer 0
`
`XOR
`
`XOR
`
`XOR
`
`XOR
`
`tagline
`Layer 2
`
`Local
`tagline
`
`global
`tagline
`
`MUX
`
`MUX
`
`MUX
`
`(d)
`(c)
`(b)
`(a)
`(c)
`(b)
`(a)
`Figure 4. (a) Conventional implementation of the IQ, (b) partitioned IQ using divided tagline, (c) implementation of the stacked
`IQ, (d) logical view of the stacked instruction queue.
`selection logic access delay beyond a single clock period.
`For the age-based priority decoding we propose the follow-
`ing to hide its delay: we perform the age-priority compu-
`tation in parallel with the selection logic (to overlap their
`delays). When the grant signal comes back, we use the now
`pre-computed age information to decide where to route the
`grant.
`Under the given assumptions, this analysis indicates we
`can add the pooling logic without impacting cycle time;
`however, it is possible that under different assumptions, on
`different designs, these overheads could be exposed. We
`will examine the potential impact in the results section.
`5. Adaptive Mechanism for Resource Pooling
`In addition to the circuit modi(cid:2)cations that are neces-
`sary to allow resource aggregation across dies, we also need
`mechanisms and policies to control the pooling or sharing
`of resources.
`In devising policies to manage the many new shared re-
`sources in this architecture, we would like to maximize
`(cid:3)exibility; however, design considerations limit the gran-
`ularity (both in time and space) at which we can partition
`core resources. Time is actually the easier issue. Because
`the aggregated structures are quite compact (in total 3D
`distance), we can reallocate partitions between cores very
`quickly, within a cycle or cycles. To reduce circuit complex-
`ity, we expect to physically repartition on a more coarse-
`grain boundary (e.g., four or eight entries rather than single
`entries).
`In the results section, we experiment with a variety
`of size granularities for reallocation of pooled resources.
`Large partitions both restrict the (cid:3)exibility of pooling and
`also tend to lengthen the latency to free resources. We also
`vary how aggressively the system is allowed to reallocate
`
`resources; speci(cid:2)cally, we explore various static settings for
`the minimum (MIN) and the maximum (MAX) value for the
`size of a partition, which determine the (cid:3)oor and the ceiling
`for core resource allocation.
`Our baseline allocation strategy exploits two principles.
`First, we need to be able to allocate resources quickly. Thus,
`we cannot reassign active partitions, which could take hun-
`dreds of cycles or more to clear active state.
`Instead we
`actively harvest empty partitions into a free list, from which
`they can later be assigned quickly. Second, because we
`can allocate resources quickly, we need not wait to harvest
`empty partitions (cid:151) we grab them immediately. This works
`because even if the same core needs the resource again right
`away, it can typically get it back in a few cycles.
`We assume a central arbitration point for the (free)
`pooled resources. A thread will request additional parti-
`tions when a resource is full. If available (on the list of free
`partitions), and the thread is not yet at its MAX value, those
`resources can be allocated upon request. As soon as a parti-
`tion has been found to be empty it is returned to the free list
`(unless the size of the resource is at MIN). The architecture
`could adjust MIN and MAX at intervals depending on the
`behavior of a thread, but this will be the focus of future work
`(cid:150) for now we (cid:2)nd static values of MIN and MAX to perform
`well. If two cores request resources in the same cycle, we
`use a simple round-robin priority scheme to arbitrate.
`6. Methodology
`In order to evaluate different resource adaptation poli-
`cies, we add support for dynamic adaptation to the SMT-
`SIM simulator [34], con(cid:2)gured for multicore simulation.
`Our power models use a methodology similar to [2]. We
`capture the energy per access and leakage power dissipation
`for individual SRAM units using CACTI-5.1 [33] targeting
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2125, p. 7
`
`

`

`45nm technology. The energy and power consumption for
`each unit is computed by multiplying access counts by the
`per-access SRAM energy. For temperature calculation we
`use Hotspot 5.0 [29].
`Table 1 gives the characteristics of our baseline core ar-
`chitectures. Note that for each of the register (cid:2)les, 32 reg-
`isters are assumed to be unavailable for pooling, as they are
`needed for the storage of architectural registers.
`
`6.1. Modeling 3D Stacked Interconnect for
`Resource Pooling
`We model Tier-to-Tier (T2T) connection with Through
`Silicon Vias (TSV). TSVs enable low-latency, high-
`bandwidth, and very dense vertical interconnect among the
`pooled blocks across multiple layers of active silicon. We
`assume four dies are stacked on top of each other. Each tier
`has an Alpha processor (high-end core case) with die size of
`6:4mm (cid:2) 6:4mm with 12 layers of metal from M1 to M12
`and the redistribution layer (RDL). The 3D stacked chip
`model is (cid:3)ip chip technology and the tiers are connected
`face-to-back. In the face-to-back connection, the RDL of
`Tier 1 (T1) is connected to the package via (cid:3)ip chip bumps,
`and the RDL of Tier 2 (T2) is connected to the M1 of T1
`via TSV and forms the T2T connection.
`Each core is placed in a single tier of the stack. TSVs
`connect the Register File (RF), Instruction Queue (IQ), Re-
`order Buffer (ROB), and Load and Store Queue (LSQ) of
`each layer vertically. The connection from bottom tier M1
`to M12 and RDL layer of the top tier is via TSV, and from
`M12 and RDL is with resistive via and local routing to the
`M1 of the sink in RF, IQ, ROB and LSQ.
`The resistive through metal via connects metal layers of
`the tiers, e.g., M1 to M2 in each tier. The vertical and
`horizontal parasitics of the metals, via, and TSV connec-
`tions have been extracted to build the interconnect model.
`A T2T connection includes a through silicon via and a
`(cid:22)bump. The parasitics of the (cid:22)bumps are small compared
`with the TSV [10]. Hence, we only model the parasitics
`of the TSVs for T2T connections. The length, diameter,
`and dielectric linear thickness of the TSV which is used for
`the T2T connection in our model are, respectively, 50(cid:22)m,
`5(cid:22)m, and 0:12(cid:22)m. A TSV is modeled as an RLC element
`with RL in series and C connected to the substrate, i.e.,
`global ground in our model. The parasitic resistance, capac-
`itance, and inductance of the T2T connections are modeled
`by RT SV =47m(cid:10), LT SV =34pH, and CT SV =88f F [11].
`The power and signal TSVs connect the power/ground
`mesh from the package (cid:3)ip chip bumps to each layer. The
`TSV pitch for the tier to tier connection is assumed to be
`uniformly distributed with a density of 80=mm2 [11]. We
`assume the TSV structures are via-last where the TSV is on
`top of the back end of the line (BEOL), i.e., RDL layer and
`the M1.
`
`Cores
`Issue,Commit width
`INT instruction queue
`FP instruction queue
`Reorder Buffer entries
`INT registers
`FP registers
`Functional units
`L1 cache
`L2 cache (priv)
`L3 cache (shared)
`L3 miss penalty
`Frequency
`Vdd
`
`Medium-End Core
`4
`2
`16 entries
`16 entries
`32 entries
`48
`48
`2 int/ldst 1 fp
`16KB, 4-way, 2 cyc
`256KB, 4-way, 10 cyc
`4MB, 4-way, 20 cyc
`250 cyc
`2GHz
`1.0V
`
`High-End Core
`4
`4
`32 entries
`32 entries
`64 entries
`64
`64
`4 int/ldst 2 fp
`32KB, 4-way, 2 cyc
`512KB, 4-way, 15 cyc
`8MB, 8-way, 30 cyc
`250 cyc
`2GHz
`1.0V
`
`Table 1. Architectural speci(cid:2)cation.
`
`Tier to Tier Path
`T1 to T2
`T1 to T3
`T1 to T4
`T2 to T3
`T2 to T4
`T3 to T4
`
`Delay (ps)
`1.26
`2.11
`2.53
`1.31
`2.19
`1.35
`
`Table 2. Tier to tier delay via TSV path.
`
`In our circuit model we extract the delay path from each
`SRAM pin (SRAM pin is a signal bump on top of the RDL
`layer) to the multiplexer of the next SRAM pin. The de-
`lay timing for each tier is around 1-2.5 ps as illustrated in
`Table 2 for tier 1 to 4.
`The TSV lands on the (cid:22)bump and landing pad. Sur-
`rounding the TSV and landing pad there is a keep-out area
`where no block, i.e., standard cell is allowed to place and
`route. We estimate the total TSVs required for connecting
`memory pins of the RF, IQ, ROB, and LSQ vertically for
`different stack up numbers in both medium-end and high-
`end cores. The total area for the TSV landing pad and the
`block out area is calculated an

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket