throbber
Homayoun
`
`Reference 43
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 1
`
`

`

`A Novel Architecture of the 3D Stacked MRAM L2 Cache for CMPs
`
`Guangyu Sun†, Xiangyu Dong†, Yuan Xie†, Jian Li‡, Yiran Chen§
`†Pennsylvania State University, ‡IBM Austin Research Lab, §Seagate Technology
`†{gsun, xydong, yuanxie}@cse.psu.edu, ‡jianli@us.ibm.com, §yiran.chen@seagate.com
`
`Abstract
`
`Magnetic random access memory (MRAM) is a promis-
`ing memory technology, which has fast read access, high
`density, and non-volatility. Using 3D heterogeneous in-
`tegrations, it becomes feasible and cost-efficient to stack
`MRAM atop conventional chip multiprocessors (CMPs).
`However, one disadvantage of MRAM is its long write la-
`tency and its high write energy.
`In this paper, we first
`stack MRAM-based L2 caches directly atop CMPs and com-
`pare it against SRAM counterparts in terms of performance
`and energy. We observe that the direct MRAM stacking
`might harm the chip performance due to the aforemen-
`tioned long write latency and high write energy. To solve
`this problem, we then propose two architectural techniques:
`read-preemptive write buffer and SRAM-MRAM hybrid
`L2 cache. The simulation result shows that our optimized
`MRAM L2 cache improves performance by 4.91% and re-
`duces power by 73.5% compared to the conventional SRAM
`L2 cache with the similar area.1
`
`1 Introduction
`
`The diminishing return of endeavors to increase clock
`frequencies and exploit instruction level parallelism in a
`single processor have led to the advent of chip multipro-
`cessors (CMPs) [8]. The integration of multiple cores on a
`single chip is expected to accentuate the already daunting
`“memory wall” problem [6] and it becomes a major chal-
`lenge of supplying massive multi-core chips with sufficient
`memories.
`The introduction of the three-dimensional (3D) integra-
`tion technology [9, 26] provides the opportunity of stacking
`memories atop compute cores and therefore alleviates the
`memory bandwidth challenge of CMPs. Recently, active
`research [4, 13, 22] has targeted SRAM caches or DRAM
`memories stacking.
`is a
`Magnetic Random Access Memory (MRAM)
`promising memory technology with attractive features such
`as fast read access, high density, and non-volatility [14,27].
`However, previous research on leveraging MRAM as on-
`chip memories is very limited. How to integrate MRAM
`
`1This work was supported in part by NSF grants (CAREER 0643902,
`CCF 0702617, CSR 0720659), a gift grant from Qualcomm, and IBM Fac-
`ulty Award.
`978-1-4244-2932-5/08/$25.00 ©2008 IEEE
`
`into compute cores on planular chips is the key obsta-
`cle since the MRAM fabrication involves hybrid magnetic-
`CMOS processes.
`Fortunately, 3D integrations enable
`the cost-efficient integration of heterogeneous technologies,
`which is ideal for MRAM stacking atop compute cores.
`Some recent work [10, 12] has evaluated the benefits of
`MRAM as a universal memory replacement for L2 caches
`and main memories in single-core chips.
`In this paper, we further evaluate the benefits of stacking
`MRAM L2 caches atop CMPs. We first develop a cache
`model for stacking MRAM and then compare the MRAM-
`based L2 cache against its SRAM counterpart with the sim-
`ilar area in terms of performance and energy. The com-
`parison shows that: (1) For applications that have moder-
`ate write intensities to L2 caches, the MRAM-based cache
`can reduce the total cache power significantly because of its
`zero standby leakage and achieve considerable performance
`improvement because of its relatively larger cache capac-
`ity; (2) For applications that have high write intensities to
`L2 caches, the MRAM-based cache can cause performance
`and power degradations due to the long latency and the high
`energy of MRAM write operations.
`These two observations imply that MRAM-based caches
`might not work efficiently if we directly introduce them
`into the traditional CMP architecture because of their dis-
`advantages on write latency and write energy. In light of
`this concern, we propose two architectural techniques, read-
`preemptive write buffer and SRAM-MRAM hybrid L2cac he,
`to mitigate the MRAM write-associated issues. The simula-
`tion result shows that performance improvement and power
`reduction can be achieved effectively with our proposed
`techniques even under the write-intensive workloads.
`
`2 Background
`
`This section briefly introduces the background of
`MRAM and 3D integration technologies.
`
`2.1 MRAM Background
`
`The basic difference between the MRAM and the con-
`ventional RAM technologies (such as SRAM/DRAM) is
`that the information carrier of MRAM is Magnetic Tun-
`nel Junctions (MTJs) instead of electric charges [27]. As
`shown in Fig. 1, each MTJ contains a pinned layer and a
`free layer. The pinned layer has fixed magnetic direction
`239
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 2
`
`

`

`while the free layer can change its magnetic direction by
`spin torque transfers [14]. If the free layer has the same di-
`rection as the pinned layer, the MTJ resistance is low and
`indicates state “0”; otherwise, the MTJ resistance is high
`and indicates state “1”.
`The latest MRAM technology (spin torque transfer ram,
`STT-RAM) changes the magnetic direction of the free layer
`by directly passing spin-polarized currents through MTJs.
`Comparing to the previous generation of MRAM using ex-
`ternal magnetic fields to reverse the MTJ status, STT-RAM
`has the advantage of scalability, which means the threshold
`current to make the status reversal will decrease as the size
`of the MTJ becomes smaller. In this paper, we use the terms
`“MRAM” and “STT-RAM” equivalently.
`The most popular structure of MRAM cells is composed
`of one NMOS transistor as the access device and one MTJ
`as the storage element (“1T1J”structure) [14]. As illustrated
`in Fig. 1, the storage element, MTJ, is connected in series
`with the NMOS transistor. The NMOS transistor is con-
`trolled by the the word line (WL) signal. The detailed read
`and write operations for each MRAM cell is described as
`follows:
`• Read Operation: When a read operation happens, the
`NMOS is turned on and a small voltage difference (-0.1V
`as demonstrated in [14]) is applied between the bit line
`(BL) and the source line (SL). This voltage difference
`causes a current through the MTJ whose value is deter-
`mined by the status of MTJs. A sense amplifier com-
`pares this current to a reference current and then decides
`whether a “0” or a “1” is stored in the selected MRAM
`cell.
`• Write Operation: When a write operation happens, a
`large positive voltage difference is established between
`SLs and BLs for writing for “0”s or a large negative one
`for writing “1”s. The current amplitude required to en-
`sure a successful status reversal is called threshold cur-
`rent. The current is related to the material of the tunnel
`barrier layer, the writing pulse duration, and the MTJ ge-
`ometry [11].
`In this work, we use the writing pulse duration of
`10ns [27], below which the writing threshold current will
`increase exponential. In addition, we scale the MRAM size
`of previous work [14] down to 65nm technology node. As-
`suming the size of MTJs is 65nm × 90nm, the derived
`threshold current for magnetic reversal is about 195μA.
`
`2.2 3D Integration Overview
`
`The 3D integration technology has recently emerged as a
`promising means to mitigate interconnect-related problems.
`By using the vertical through silicon via (TSV), multiple
`active device layers can be stacked together (through wafer
`stacking or die stacking) in the third dimension [26].
`3D integrations offer a number of advantages over tradi-
`tional two-dimensional (2D) designs [9]: (1) shorter global
`
`interconnects because the vertical distance (or the length of
`TSVs) between two layers is usually in the range of 10 μm
`to 100 μm [26] depending on manufacturing processes; (2)
`higher performance because of reducing the average inter-
`connect length; (3) lower interconnect power consumption
`due to the wire length reduction; (4) denser form factor and
`smaller footprint; (5) support for the cost-efficient integra-
`tion of heterogenous technologies.
`In this paper, we rely on the 3D integration technol-
`ogy to stack a massive amount of L2 caches (2MB for
`SRAM caches and 8MB for MRAM caches) on top of
`CMPs. Furthermore, the heterogenous technology integra-
`tion enabled by 3D makes it feasible to fabricate MRAM
`caches and CMP logics as two separate dies and then stack
`them together in a vertical way. Therefore, the magnetic-
`related fabrication process of MRAM will not affect the
`normal CMOS logic fabrication and keep the integration
`cost-efficient.
`
`3 MRAM and Non-Uniform Cache Access
`(NUCA) Models
`
`In this section, we describe an MRAM circuit model and
`a NUCA model which is implemented with Network-on-
`Chip (NoC).
`
`3.1 MRAM Modeling
`
`To model MRAM, we first estimate the area of MRAM
`cells. As shown in Fig. 1, each MRAM cell is composed
`of one NMOS transistor and one MTJ. The size of MTJs is
`only limited by manufacturing techniques, but the NMOS
`transistor has to be sized properly so that it can drive suf-
`ficiently large current to change the MTJ status. The cur-
`rent driving ability of NMOS transistor is proportional to its
`W/L ratio. Using HSPICE simulation, we find that the min-
`imum W/L ratio for the NMOS transistor under 65nm tech-
`nology node is around 10 to drive the threshold writing cur-
`rent of 195μA. We further assume the width of the source
`or drain regions of an NMOS transistor is 1.5F , where F
`is the feature size. Therefore, we estimate the MRAM cell
`size is about 10F × 4F = 40F 2. The parameters of our
`targeted MRAM cell are tabulated in Table .
`Table 1. MRAM Cell Specifications
`65nm
`Technology
`10ns
`Write Pulse Duration
`195μA
`Threshold Current
`40F 2
`Cell Size
`2.5
`Aspect Ratio
`
`Despite the difference in storage mechanisms, MRAM
`and SRAM have the similar peripheral interfaces from the
`circuit designers’ points of view. By simulating with a mod-
`ified version of CACTI [2], our result shows that the area of
`a 512KB MRAM cache is similar to a 128KB SRAM cache
`240
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 3
`
`

`

`
`
`Bit LineBit Line
`
`
`
`MTJMTJ
`
`
`
`Word LineWord Line
`
`
`Bipolar Bipolar
`
`Write Pulse /Write Pulse /
`
`Read Bias Read Bias
`
`GeneratorGenerator
`
`
`Free layerFree layer
`
`Pinned layerPinned layer
`
`
`
`TransistorTransistor
`
`
`
`Sense Amp.Sense Amp.
`
`
`
`Source LineSource Line
`
`
`Ref.Ref.
`Figure 1. An illustration of an
`MRAM cell
`
`       
` 

`
`Figure 2. Eight caches ways are
`distributed in four banks. As-
`sume four cores and accordingly
`four zones each layer.
`
`'DWD0LJUDWLRQ
`
`/D\HU
`
`9HUWLFDO+RS
`
`/D\HU
`
`/D\HU
`
`7KURXJK6LOLFRQ%XV
`76%
`
`&DFKH
`EDQN
`
`&DFKH
`EDQN
`
`5
`
`5
`
`&DFKH
`EDQN
`
`&DFKH
`EDQN
`
`+RUL]RQWDO+RS
`
`5
`
`5
`
`&DFKH%DQN
`
`
`
`5RXWHU5RXWHU
`
`7KURXJK
`/D\HU%XV
`
`&DFKH
`&RQWUROOHU
`
`&RUH
`
`D
`E
`Figure 3. (a) An illustration of the proposed 3D NUCA
`structure, which includes 1 core layer, 2 cache layers.
`There are 4 processing cores per core layer, 32 cache
`banks per cache layer, and 4 through-layer-bus across
`layers; (b) Connections amongst routers, caches banks
`and through-layer-buses.
`
`whose cell is about 146F 2 (this value is extracted from
`CACTI). Table 2 lists the comparison between a 512KB
`MRAM cache bank and a 128KB SRAM cache bank, which
`are used later in this paper, in terms of area, access time,
`and access energy.
`
`Table 2. Comparison of area, access time,
`and energy comparison(65nm technology)
`Cache size
`128KB SRAM 512KB MRAM
`3.62mm2
`3.30mm2
`Area
`Read Latency
`2.252ns
`2.318ns
`Write Latency
`2.264ns
`11.024ns
`Read Energy
`0.895nJ
`0.858nJ
`Write Energy
`0.797nJ
`4.997nJ
`
`3.2 Modeling 3D NUCA Cache
`
`As the caches capacity and area increase, the wire delay
`has made the Non-Uniform Cache Access (NUCA) archi-
`tecture [18] more attractive than the conventional Uniform
`Cache Access (UCA) one. In NUCA, the cache is divided
`into multiple banks with different access latencies accord-
`ing to their locations relative to cores and these banks can be
`connected through a mesh-based Network-on-Chip (NoC).
`Extending the work of CACTI [2], we develop our NoC-
`based 3D NUCA model. The key concept is to use NoC
`routers for communications within planular layers, while
`using a specific through silicon bus (TSB) for commu-
`nications among different layers. Figure 3(a) illustrates
`an example of the 3D NUCA structure. There are four
`cores located in the core layer and 32 cache banks in each
`cache layer and all layers are connected by through silicon
`bus (TSB) which is implemented with TSVs. This intercon-
`nect style has the advantage of short connections provided
`by 3D integrations. It has been reported the vertical latency
`
`of traversing a 20-layer stack is only 12ps [23], thus the la-
`tency of TSB negligible compared to the latency of 2D NoC
`routers. Consequently, it is feasible to have single-hop ver-
`tical communications by utilizing TSBs.
`In addition, hy-
`bridization of 2D NoC routers with TSBs require one (in-
`stead of two) additional link on each NoC router, because
`TSB can move data both upward and downward [20].
`As shown inF igure3(a), cache layers are on top of
`core layers and they can either SRAM or MRAM caches.
`Figure3(b) shows a detailed 2D structure of cache layers.
`Every four cache banks are grouped together and routed to
`other layers via TSBs.
`Similar to prior approaches [7, 20], the proposed model
`supports data migration, which moves data closer to their
`accessing core. For set-associative cache, the cache ways
`belonging to the set should be distributed into different
`banks so that data migration can be implemented. In our
`3D NUCA model, each cache layer is equally divided into
`several zones. The number of zones is equal to the number
`of cores and each zone has a TSB located at its center. The
`cache ways of each set are uniformly distributed into these
`zone. This architecture promises that, within each cache
`set, there are several ways of cache lines close to the ac-
`tive core. Fig. 2 gives an illustration of distributing eight
`ways into four zones. Fig. 3(a) shows an example of data
`migration after which the core in the upper-left corner can
`access the data faster. In this paper, this kind of data migra-
`tions is called inter-migration to differentiate another kind
`of migration policy introduced later.
`The advantages of this 3D NUCA cache are:(1) plac-
`ing L2 caches in separate layers makes it possible to in-
`tegrate MRAM with traditional CMOS process technology;
`(2) separating cores from caches simplifies the design of
`TSBs and routers because TSBs are now connected to cache
`controllers directly, and there is no direct connection be-
`241
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 4
`
`

`

`Table 3. Baseline configuration parameters
`Processors
`# of cores
`Frequency
`Power
`Issue Width
`Memory Parameters
`L1 cache
`
`8
`3GHz
`6W/core
`1 (in order)
`
`SRAM L2
`
`MRAM L2
`
`private, 16+16KB, 2-way,
`64B line, 2-cycle,
`write-through, 1 read/write port
`shared, 2MB (16x128KB),
`32-way, 64B line,
`read/write per bank : 7-cycle,
`write-back, 1 read/write port
`shared, 8MB (16x512KB),
`32-way, 64B line,
`read penalty per bank : 7-cycle,
`write penalty per bank : 33-cycle,
`write-back, 1 read/write port
`4 entry, retire-at-2
`4GB, 500-cycle latency
`
`Write buffer
`Main Memory
`Network Parameters
`# of Layers
`# of TSB
`Hop latency
`
`2
`8
`TSB 1 cycle, V hop 1 cycle
`H hop 1 cycle
`2-cycle
`Router Latency
`tween routers and cache controllers.
`We provide one TSB for each core in the model. Con-
`sidering that the TSV pitch size is reported to be only 4-
`10μm [23], thus even a 1024-bit bus (much wider than
`our proposed TSB) would only incur an area overhead of
`0.32mm2. In our study, the die area of an 8-core CMP is
`estimated to be 60mm2 (discussed later). Therefore, it is
`feasible to assign one TSB for each core and the TSV area
`overhead is negligible.
`
`3.3 Configurations and Assumptions
`
`Our baseline configuration is an 8-core in-order proces-
`sor using the Ultra SparcIII ISA. In order to predict the chip
`area, we investigate some die photos, such as Cell Proces-
`sor [16], Sun UltraSPARC T1 [19], etc. and estimate the
`area of an 8-core CMP without caches to be 60mm2. By
`using our modified version of CACTI [2], we further learn
`that one cache layer fits to either a 2M B SRAM or an 8M B
`MRAM L2 cache assuming each cache layer has the simi-
`lar area to that of core layer (60mm2). The configurations
`are detailed in Table 3. Note that the power of processors is
`estimated based on the data sheet of real designs [16, 19].
`We use the Simics toolset [24] for performance simu-
`lations. Our 3D NUCA architecture is implemented as an
`extended module in Simics. We use a few multi-threaded
`benchmarks from OpenMP2001 [3] and PARSEC [1] suites.
`
`Since the performance and power of MRAM caches are
`closely related to transaction intensity, we select some sim-
`ulation workloads as listed in Table 4 so that we have a wide
`range of transaction intensities to L2 caches. The average
`numbers of total transactions (TPKI)2 and write transac-
`tions (WPKI) of L2 caches are listed in Table 4. For each
`simulation, we fast forward to warm up the caches and then
`run 3 billion cycles. We use the total IPC of all the cores as
`the performance metric.
`Table 4. L2 transaction intensities
`Name
`TPKI WPKI
`galgel
`1.01
`0.31
`apsi
`4.15
`1.85
`equake
`7.94
`3.84
`fma3d
`8.43
`4.00
`swim
`19.29
`9.76
`streamcluster
`55.12
`23.326
`
`3.4 SNUCA and DNUCA
`
`Static NUCA (SNUCA) and Dymaic NUCA (DNUCA)
`are two different implementations of the NUCA architec-
`ture proposed by Kim, et al. [18]. SNUCA statically parti-
`tions the address space across cache banks, which are con-
`nected via NoC; DNUCA dynamically migrates frequently
`accessed blocks to the closest banks. These two NUCA im-
`plementations result in different access patterns and vari-
`able write intensities.
`In our later simulations, we use
`both SNUCA-SRAM and DNUCA-SRAM L2 caches as
`our baselines when evaluating the performance and power
`benefits of MRAM caches.
`
`4 Direct Replacing SRAM with MRAM as
`L2 Caches
`
`In this section, we directly replace SRAM L2 caches
`with MRAM ones that have the comparable area, and show
`that without any optimization, a naive MRAM replacement
`will harm both performance and power when the workload
`write intensity is high.
`
`4.1 Same Area Replacement
`
`As shown in Table 2, a 128KB SRAM bank has the simi-
`lar area as a 512KB MRAM bank does. Thereby, in order to
`keep the area of cache layers unchanged, it becomes reason-
`able to replace SRAM L2 caches with MRAM ones whose
`capacity is 3 times larger. We call this replacement strategy
`as “same area replacement”.
`Using this strategy, we integrate as many caches in the
`cache layers as possible. Considering our baseline SRAM
`
`2TPKI is the number of total transactions per 1K instructions and WPKI
`is the number of write transactions per 1K instructions.
`242
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 5
`
`

`

`065$0618&$
`
`005$0618&$
`
`
`
`
`
` 
`
`
`
`
`
`
`
`
`
`065$0'18&$
`
`005$0'18&$
`
`
`
`
`
` 
`
`
`
`
`
`
`
`
`
`DWH
`
`PDOL]HG0LVV5D
`
`1RUP
`
`VWUHDPFOXVWHU
`VZDSWLRQV
`IPDG
`VZLP
`VWUHDPFOXVWHU
`VZDSWLRQV
`IPDG
`VZLP
`Figure 4. The comparison of L2 caches access miss rates for SRAM L2 cache and MRAM L2 cache
`that have similar area. Larger capacity of MRAM cache results in smaller cache miss rates.
`
`065$0618&$ 005$0618&$ 065$0'18&$ 005$0'18&$
`
`065$0618&$ 005$0618&$ 065$0'18&$ 005$0'18&$
`
`
`
`
` 
`
`
`
`
`
`RZHU
`
`1RUPDOL]HG3R
`
`VWUHDPFOXVWHU
`VZLP
`IPDG
`HTXDNH
`DSVL
`JDOJHO
`Figure 6. Power comparison of SRAM and
`MRAM L2 caches (Normalized by 2MB
`SNUCA SRAM cache).
`
`MRAM write operations, the high write intensity is re-
`flected by the performance loss. When the write intensity
`is sufficiently high, the resulting performance loss over-
`whelms the performance gain achieved by reduced L2 cache
`miss rate. This observation is further supported by com-
`parison between SNUCA and DNUCA. From Fig. 5, one
`can observe that performance degradation is more signif-
`icant when we use DNUCA MRAM caches because data
`migrations in DNUCA initiate more write operations than
`SNUCA does and thus cause high write intensities.
`To summarize, we conclude our first observation of using
`MRAM caches as:
`
`Observation 1 Replacing SRAM L2 caches directly with
`MRAM, which has the similar area but with a large ca-
`pacity, can reduce the access miss rate of the L2 cache.
`However, the long latency associated with the write op-
`erations to the MRAM cache has a negative impact on
`the performance. When the write intensity is high, the
`benefits caused by miss rate reductions could be off-
`set by the long latency of MRAM write operations and
`eventually result in performance degradation.
`
`4.3 Power Analysis
`
`The major contributors of the total power consumption
`in caches are leakage power and dynamic power:
`• Leakage Power: When process technology scales down
`to sub-90nm, the leakage power in CMOS technology be-
`comes dominant. Since MRAM is a non-volatile memory
`technology, there is no power supply to each MRAM cell
`and then MRAM cells do not consume any standby leak-
`
`243
`
`
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`VWUHDPFOXVWHU
`VZLP
`IPDG
`HTXDNH
`DSVL
`JDOJHO
`IPC comparison of SRAM and
`Figure 5.
`MRAM L2 caches(Normalized by 2M SNUCA
`SRAM cache).
`
`L2 cache has 16 banks and each cache bank has the ca-
`pacity of 128KB, we keep the number of banks unchanged
`but replace each 128KB SRAM L2 cache bank with a
`512KB MRAM cache bank. The read/write access time and
`read/write energy consumption are tabulated in Table 2 for
`both SRAM and MRAM.
`
`4.2 Performance Analysis
`
`Because the number of banks remains the same and our
`modified CACTI shows 128KB SRAM bank and 512KB
`MRAM bank have similar read latencies (2.252ns versus
`2.318ns in Table 2), the read latencies of the 2MB SRAM
`cache and the 8MB MRAM cache are similar as well. Since
`the MRAM cache capacity is 3 times larger, the access miss
`rate to the L2 cache decreases as shown inF ig. 4. On
`average, the miss rates are reduced by 19.0% and 12.5%
`for SNUCA MRAM cache and DNUCA MRAM cache, re-
`spectively.
`The IPC comparison is illustrated inF ig. 5. Caused by
`the large MRAM cache capacity, the L2 cache miss rate
`decrease improves the performance of the first two work-
`loads (“galgel” and “apsi”); however the performance of
`the rest four workloads is not improved as expected. On
`average, the performance degradation of SNUCA MRAM
`and DNUCA MRAM is 3.09% and 7.52% compared to their
`SRAM counterparts, respectively.
`This performance degradation of direct MRAM replace-
`ment can be explained by Table 4, where we can observe the
`write operation intensity (presented by WPKI) of “euqake”,
`“fma3d”, “swim”, and “streamcluster” is much higher than
`that of “galgel” and “apsi”. Due to the long latency of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 6
`
`

`

`age power. Therefore, we only consider peripheral circuit
`leakage power for MRAM caches and the leakage power
`comparison of 2MB SRAM and 8MB MRAM is listed in
`Table 5.
`
`Table 5. Leakage power of SRAM and MRAM
`caches at 80 ◦C
`Cache configurations
`Leakage power
`2MB 16 × 128KB SRAM cache
`2.089W
`8MB 16 × 512KB MRAM cache
`0.255W
`• Dynamic Power: The dynamic power estimation for the
`NUCA cache is described as follows. For each transac-
`tion, the total dynamic power is composed of the memory
`cell access power, the router access power, and the power
`consumed by wire connections. In this paper, these val-
`ues are either simulated by HSPICE or obtained from
`our modified version of CACTI. The access number of
`routers and the length of wire connections vary from the
`location of the requesting core and the requested cache
`lines.
`Fig. 6 shows the power comparison of SRAM and
`MRAM L2 caches. One can observe that:
`• For SRAM L2 caches, since the leakage power domi-
`nates, the total power for SNUCA SRAM and DNUCA
`SRAM are very close. On the contrary, the dynamic
`power dominates the MRAM cache power.
`• For all the workloads, MRAM caches consume less
`power than SRAM caches do. The average power sav-
`ings across all the workloads are about 78% and 68% for
`SNUCA and DNUCA, respectively. The power saving
`for DNUCA MRAM is smaller because of the high write
`intensity caused by data migrations. It is obvious that the
`“low leakage power” feature makes MRAM more attrac-
`tive to be used as large on-chip memory, especially when
`SRAM leakage power becomes worse with technology
`scaling.
`• The average power savings for the first four workloads
`are more than 80%. However, for the workload “stream-
`cluster”, the total power saving is only 63% and 30%
`for SNUCA and DNUCA, respectively, due to its much
`higher L2 cache write intensity (see Table 4).
`To summarize, our second conclusion of direct MRAM
`cache replacement is:
`
`Observation 2 Direct replacing the SRAM L2 cache with
`a MRAM cache, which has similar area but with larger
`capacity, can greatly reduce the leakage power. How-
`ever, when the write intensity is high, the dynamic
`power increases significantly because of the high en-
`ergy associated with the MRAM write operation and
`the amount of total power saving could be reduced.
`
`These two conclusions show that, if we directly replace
`SRAM caches with MRAM caches using “same area strat-
`egy”, the long latency and high energy consumption of
`
`MRAM write operations can offset the performance and
`power benefit brought by MRAM cache when the cache
`write intensity is high.
`
`5 Novel 3D-stacked cache architecture
`
`In this section we propose two techniques to mitigate
`the write operation problem of using MRAM caches: read-
`preemptive write buffer is employed to reduce the stall time
`caused by the MRAM long write latency; SRAM-MRAM hy-
`brid L2cac he is proposed to reduce the number of MRAM
`write operations and thereby improve both performance and
`power. Finally, we combine these two techniques together
`as an optimized MRAM cache architecture.
`
`5.1 Read-preemptive Write Buffer
`
`The first observation in Section shows that the long
`MRAM write latency has a serious impact on the perfor-
`mance. In the scenario where a write operation is followed
`by several read operations, the ongoing write operation may
`block the upcoming read operations and cause performance
`degradations. Although the write buffer design in modern
`processors works well for SRAM caches, our experiment
`result in Subsection 4.2 shows that this write buffer does
`not fit for MRAM caches due to the large variation between
`MRAM read latency and write latency. In order to make
`MRAM caches work efficiently, we explore the proper write
`buffer size and propose a “read-preemptive” management
`policy for it.
`
`5.1.1 The Exploration of the Buffer Size
`
`The choice of the buffer size is important. The larger the
`buffer size is, the more write operations can be hidden.
`Thereby, the number of stall cycles decreases. However, on
`the other hand, the larger the buffer size is, the longer time
`it takes to check whether there is a ‘”hit” in the buffer and
`then to access it. Furthermore, the design complexity and
`the area overhead also increase with the buffer size growth.
`Fig. 7 shows the relative IPC improvement by using differ-
`ent buffer sizes for workloads “streamcluster” and “swim”.
`Observing the simulation result, we choose the size of 20
`entries as the optimal MRAM write buffer size. Compared
`to the SRAM write buffer, which has only 4 entries (as listed
`in Table 3), the MRAM write buffer size is much larger and
`we use 20-entry write buffer for MRAM caches in the later
`simulations.
`
`5.1.2 Read-preemptive Policy
`
`Since the L2 cache can receive requests from from the up-
`per level memory (L1 cache) and the write buffer, a priority
`policy is necessary to solve the conflict that a read request
`244
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 7
`
`

`

`6ZLP
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`YHPHQW
`
`DWLYH,3&,PSUR
`
`5HOD
`
`6WUHDPFOXVWHU
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`YHPHQW
`
`DWLYH,3&,PSUR
`
`5HOD
`
`
`
`
`
`%XIIHUVL]H
`%XIIHU6L]H
`Figure 7. The impact of buffer size. The IPC improvement is normalized by that of 8M MRAM cache
`without write buffer
`and a write request compete for the execution right. For
`MRAM caches, write operation latencies are much larger
`than read latencies, thus our objective is to prevent write
`operations from blocking read operations. As a result, we
`have our first rule:
`
`the non-conditional preemption policy and α = 0% repre-
`sents the traditional write buffer. We can find that, for the
`workloads with low write intensities, such as “galgel” and
`“apsi”, the performance improves as α increases and the
`non-conditional preemption policy works the best. How-
`ever, for the benchmark with high write intensities, like
`“streamcluster”, the performance improves at the begin-
`ning but then degrades as α increases. Generally, in this
`paper, we set α = 50% to make our read-preemptive policy
`effective for all the workloads.
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`IPDG
`
`VZLP
`
`VWUHDPFOXVWHU
`
`
`
`
`
`
`
`
`
`
`
`
`
`ODWLYH,3&
`
`5HO
`
`
`Į 
`Į 
`Į 
`Į 
`Į 
`Figure 8. The impact of α on the performance.
`The IPC values are normalized by that of us-
`ing the traditional policy.
`A counter is required in order to make the accomplish-
`ment degree aware to the cache controller. The counter re-
`sets to zero and begins to count the number of cycles when
`a retirement begins. The cache controller check the counter
`and decides whether to stall the retirement for the read re-
`quest. The area of 20 buffer entries can be evaluated as a
`cache whose size is 20 × 64Byte(less than 2KB). We use
`a 7 − bit counter to record the retirement accomplishment
`degree. Since the area of each 3D-stacked layer is around
`60mm2, the area overhead of our proposed read-preemptive
`write buffer is less than 1%. Similarly, the leakage power
`increase caused by this buffer is also negligible.
`Fig. 9 and Fig. 10 illustrates the performance and power
`improvement gained by our proposed read-preemptive write
`buffer. Compared to the IPC of SRAM baseline configu-
`rations, the average performance improvements are 9.93%
`245
`
`Rule1: The read operation always has the higher priority
`in a competition for the execution right.
`
`Additionally, consider there is a read request blocked
`by a write operation that is already in process, the MRAM
`write latency is so large that its retirement may block one or
`more read request for a long period and further causes per-
`formance degradations. In order to mitigate this problem,
`we propose another read-preemptive rule as follows:
`
`Rule2: When a read request is blocked by a write retire-
`ment and the write buffer is not full, the read request
`can trap and stall the write retirement if the preemp-
`tion condition (discussed later) is satisfied. Then, the
`read operation obtains the right of the execution to the
`cache. The stalled write retirement will retry later.
`
`Our proposed read-preemptive policy tries to execute
`MRAM read requests as early as possible, but the drawback
`is that some write retirements need to be re-executed and
`the possibility of full buffer increases. The pivot is to find
`a proper preemption condition. One extreme method is to
`stall the write retirement as long as there is a read request,
`which means that read requests can always be executed im-
`mediately. Theoretically, if the write buffer size is large
`enough, no read request will be blocked. However, since the
`buffer size is limited, the increased possibility of full buffer
`could also harm the performance.
`In some other cases,
`stalling write retirements for read requests are not always
`good. For example, if a write retirement almost finishes,
`no read request should stall the retirement process. Conse-
`quently, we propose to use the retirement accomplishment
`degree, denoted as α, as the preemption condition. The re-
`tirement accomplishment degree is the accomplishment per-
`centage of the ongoing write retirement, below which no
`preemption will occur.
`Fig. 8 compares the IPC of using different α in our
`read-preemptive policy. Note that α = 100% represents
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 8
`
`

`

`065$0618&$
`
`005$0618&$
`
`005$0618&$
`UHDGSUHHPSWLYH
`
`065$0'18&$
`
`005$0'18&$
`
`005$0'18&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`IPDG
`
`VZLP
`
`VWUHDPFOXVWHU
`
`IPDG
`E
`D
`Figure 9. The comparison of IPC among 2M SRAM, 8M MRAM with traditional write buffer, and 8M
`MRAM with read-preemptive write buffer (Normalized by that of SRAM).
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`VZLP
`
`VWUHDPFOXVWHU
`
`065$0'18&$
`
`005$0'18&$
`
`005$0'18&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`
`
`
`
`ZHU
`
`1RUPDOL]HG3RZ
`
`065$0618&$
`
`005$0618&$
`
`005$0618&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`
`
`
`
`ZHU
`
`1RUPDOL]HG3RZ
`
`JDOJHO
`DSVL
`HTXDNH
`IPDG
`VZLP
`VWUHDPFOXVWH

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket