`
`Reference 43
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 1
`
`
`
`A Novel Architecture of the 3D Stacked MRAM L2 Cache for CMPs
`
`Guangyu Sun†, Xiangyu Dong†, Yuan Xie†, Jian Li‡, Yiran Chen§
`†Pennsylvania State University, ‡IBM Austin Research Lab, §Seagate Technology
`†{gsun, xydong, yuanxie}@cse.psu.edu, ‡jianli@us.ibm.com, §yiran.chen@seagate.com
`
`Abstract
`
`Magnetic random access memory (MRAM) is a promis-
`ing memory technology, which has fast read access, high
`density, and non-volatility. Using 3D heterogeneous in-
`tegrations, it becomes feasible and cost-efficient to stack
`MRAM atop conventional chip multiprocessors (CMPs).
`However, one disadvantage of MRAM is its long write la-
`tency and its high write energy.
`In this paper, we first
`stack MRAM-based L2 caches directly atop CMPs and com-
`pare it against SRAM counterparts in terms of performance
`and energy. We observe that the direct MRAM stacking
`might harm the chip performance due to the aforemen-
`tioned long write latency and high write energy. To solve
`this problem, we then propose two architectural techniques:
`read-preemptive write buffer and SRAM-MRAM hybrid
`L2 cache. The simulation result shows that our optimized
`MRAM L2 cache improves performance by 4.91% and re-
`duces power by 73.5% compared to the conventional SRAM
`L2 cache with the similar area.1
`
`1 Introduction
`
`The diminishing return of endeavors to increase clock
`frequencies and exploit instruction level parallelism in a
`single processor have led to the advent of chip multipro-
`cessors (CMPs) [8]. The integration of multiple cores on a
`single chip is expected to accentuate the already daunting
`“memory wall” problem [6] and it becomes a major chal-
`lenge of supplying massive multi-core chips with sufficient
`memories.
`The introduction of the three-dimensional (3D) integra-
`tion technology [9, 26] provides the opportunity of stacking
`memories atop compute cores and therefore alleviates the
`memory bandwidth challenge of CMPs. Recently, active
`research [4, 13, 22] has targeted SRAM caches or DRAM
`memories stacking.
`is a
`Magnetic Random Access Memory (MRAM)
`promising memory technology with attractive features such
`as fast read access, high density, and non-volatility [14,27].
`However, previous research on leveraging MRAM as on-
`chip memories is very limited. How to integrate MRAM
`
`1This work was supported in part by NSF grants (CAREER 0643902,
`CCF 0702617, CSR 0720659), a gift grant from Qualcomm, and IBM Fac-
`ulty Award.
`978-1-4244-2932-5/08/$25.00 ©2008 IEEE
`
`into compute cores on planular chips is the key obsta-
`cle since the MRAM fabrication involves hybrid magnetic-
`CMOS processes.
`Fortunately, 3D integrations enable
`the cost-efficient integration of heterogeneous technologies,
`which is ideal for MRAM stacking atop compute cores.
`Some recent work [10, 12] has evaluated the benefits of
`MRAM as a universal memory replacement for L2 caches
`and main memories in single-core chips.
`In this paper, we further evaluate the benefits of stacking
`MRAM L2 caches atop CMPs. We first develop a cache
`model for stacking MRAM and then compare the MRAM-
`based L2 cache against its SRAM counterpart with the sim-
`ilar area in terms of performance and energy. The com-
`parison shows that: (1) For applications that have moder-
`ate write intensities to L2 caches, the MRAM-based cache
`can reduce the total cache power significantly because of its
`zero standby leakage and achieve considerable performance
`improvement because of its relatively larger cache capac-
`ity; (2) For applications that have high write intensities to
`L2 caches, the MRAM-based cache can cause performance
`and power degradations due to the long latency and the high
`energy of MRAM write operations.
`These two observations imply that MRAM-based caches
`might not work efficiently if we directly introduce them
`into the traditional CMP architecture because of their dis-
`advantages on write latency and write energy. In light of
`this concern, we propose two architectural techniques, read-
`preemptive write buffer and SRAM-MRAM hybrid L2cac he,
`to mitigate the MRAM write-associated issues. The simula-
`tion result shows that performance improvement and power
`reduction can be achieved effectively with our proposed
`techniques even under the write-intensive workloads.
`
`2 Background
`
`This section briefly introduces the background of
`MRAM and 3D integration technologies.
`
`2.1 MRAM Background
`
`The basic difference between the MRAM and the con-
`ventional RAM technologies (such as SRAM/DRAM) is
`that the information carrier of MRAM is Magnetic Tun-
`nel Junctions (MTJs) instead of electric charges [27]. As
`shown in Fig. 1, each MTJ contains a pinned layer and a
`free layer. The pinned layer has fixed magnetic direction
`239
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 2
`
`
`
`while the free layer can change its magnetic direction by
`spin torque transfers [14]. If the free layer has the same di-
`rection as the pinned layer, the MTJ resistance is low and
`indicates state “0”; otherwise, the MTJ resistance is high
`and indicates state “1”.
`The latest MRAM technology (spin torque transfer ram,
`STT-RAM) changes the magnetic direction of the free layer
`by directly passing spin-polarized currents through MTJs.
`Comparing to the previous generation of MRAM using ex-
`ternal magnetic fields to reverse the MTJ status, STT-RAM
`has the advantage of scalability, which means the threshold
`current to make the status reversal will decrease as the size
`of the MTJ becomes smaller. In this paper, we use the terms
`“MRAM” and “STT-RAM” equivalently.
`The most popular structure of MRAM cells is composed
`of one NMOS transistor as the access device and one MTJ
`as the storage element (“1T1J”structure) [14]. As illustrated
`in Fig. 1, the storage element, MTJ, is connected in series
`with the NMOS transistor. The NMOS transistor is con-
`trolled by the the word line (WL) signal. The detailed read
`and write operations for each MRAM cell is described as
`follows:
`• Read Operation: When a read operation happens, the
`NMOS is turned on and a small voltage difference (-0.1V
`as demonstrated in [14]) is applied between the bit line
`(BL) and the source line (SL). This voltage difference
`causes a current through the MTJ whose value is deter-
`mined by the status of MTJs. A sense amplifier com-
`pares this current to a reference current and then decides
`whether a “0” or a “1” is stored in the selected MRAM
`cell.
`• Write Operation: When a write operation happens, a
`large positive voltage difference is established between
`SLs and BLs for writing for “0”s or a large negative one
`for writing “1”s. The current amplitude required to en-
`sure a successful status reversal is called threshold cur-
`rent. The current is related to the material of the tunnel
`barrier layer, the writing pulse duration, and the MTJ ge-
`ometry [11].
`In this work, we use the writing pulse duration of
`10ns [27], below which the writing threshold current will
`increase exponential. In addition, we scale the MRAM size
`of previous work [14] down to 65nm technology node. As-
`suming the size of MTJs is 65nm × 90nm, the derived
`threshold current for magnetic reversal is about 195μA.
`
`2.2 3D Integration Overview
`
`The 3D integration technology has recently emerged as a
`promising means to mitigate interconnect-related problems.
`By using the vertical through silicon via (TSV), multiple
`active device layers can be stacked together (through wafer
`stacking or die stacking) in the third dimension [26].
`3D integrations offer a number of advantages over tradi-
`tional two-dimensional (2D) designs [9]: (1) shorter global
`
`interconnects because the vertical distance (or the length of
`TSVs) between two layers is usually in the range of 10 μm
`to 100 μm [26] depending on manufacturing processes; (2)
`higher performance because of reducing the average inter-
`connect length; (3) lower interconnect power consumption
`due to the wire length reduction; (4) denser form factor and
`smaller footprint; (5) support for the cost-efficient integra-
`tion of heterogenous technologies.
`In this paper, we rely on the 3D integration technol-
`ogy to stack a massive amount of L2 caches (2MB for
`SRAM caches and 8MB for MRAM caches) on top of
`CMPs. Furthermore, the heterogenous technology integra-
`tion enabled by 3D makes it feasible to fabricate MRAM
`caches and CMP logics as two separate dies and then stack
`them together in a vertical way. Therefore, the magnetic-
`related fabrication process of MRAM will not affect the
`normal CMOS logic fabrication and keep the integration
`cost-efficient.
`
`3 MRAM and Non-Uniform Cache Access
`(NUCA) Models
`
`In this section, we describe an MRAM circuit model and
`a NUCA model which is implemented with Network-on-
`Chip (NoC).
`
`3.1 MRAM Modeling
`
`To model MRAM, we first estimate the area of MRAM
`cells. As shown in Fig. 1, each MRAM cell is composed
`of one NMOS transistor and one MTJ. The size of MTJs is
`only limited by manufacturing techniques, but the NMOS
`transistor has to be sized properly so that it can drive suf-
`ficiently large current to change the MTJ status. The cur-
`rent driving ability of NMOS transistor is proportional to its
`W/L ratio. Using HSPICE simulation, we find that the min-
`imum W/L ratio for the NMOS transistor under 65nm tech-
`nology node is around 10 to drive the threshold writing cur-
`rent of 195μA. We further assume the width of the source
`or drain regions of an NMOS transistor is 1.5F , where F
`is the feature size. Therefore, we estimate the MRAM cell
`size is about 10F × 4F = 40F 2. The parameters of our
`targeted MRAM cell are tabulated in Table .
`Table 1. MRAM Cell Specifications
`65nm
`Technology
`10ns
`Write Pulse Duration
`195μA
`Threshold Current
`40F 2
`Cell Size
`2.5
`Aspect Ratio
`
`Despite the difference in storage mechanisms, MRAM
`and SRAM have the similar peripheral interfaces from the
`circuit designers’ points of view. By simulating with a mod-
`ified version of CACTI [2], our result shows that the area of
`a 512KB MRAM cache is similar to a 128KB SRAM cache
`240
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 3
`
`
`
`
`
`Bit LineBit Line
`
`
`
`MTJMTJ
`
`
`
`Word LineWord Line
`
`
`Bipolar Bipolar
`
`Write Pulse /Write Pulse /
`
`Read Bias Read Bias
`
`GeneratorGenerator
`
`
`Free layerFree layer
`
`Pinned layerPinned layer
`
`
`
`TransistorTransistor
`
`
`
`Sense Amp.Sense Amp.
`
`
`
`Source LineSource Line
`
`
`Ref.Ref.
`Figure 1. An illustration of an
`MRAM cell
`
`
`
`
`Figure 2. Eight caches ways are
`distributed in four banks. As-
`sume four cores and accordingly
`four zones each layer.
`
`'DWD0LJUDWLRQ
`
`/D\HU
`
`9HUWLFDO+RS
`
`/D\HU
`
`/D\HU
`
`7KURXJK6LOLFRQ%XV
`76%
`
`&DFKH
`EDQN
`
`&DFKH
`EDQN
`
`5
`
`5
`
`&DFKH
`EDQN
`
`&DFKH
`EDQN
`
`+RUL]RQWDO+RS
`
`5
`
`5
`
`&DFKH%DQN
`
`
`
`5RXWHU5RXWHU
`
`7KURXJK
`/D\HU%XV
`
`&DFKH
`&RQWUROOHU
`
`&RUH
`
`D
`E
`Figure 3. (a) An illustration of the proposed 3D NUCA
`structure, which includes 1 core layer, 2 cache layers.
`There are 4 processing cores per core layer, 32 cache
`banks per cache layer, and 4 through-layer-bus across
`layers; (b) Connections amongst routers, caches banks
`and through-layer-buses.
`
`whose cell is about 146F 2 (this value is extracted from
`CACTI). Table 2 lists the comparison between a 512KB
`MRAM cache bank and a 128KB SRAM cache bank, which
`are used later in this paper, in terms of area, access time,
`and access energy.
`
`Table 2. Comparison of area, access time,
`and energy comparison(65nm technology)
`Cache size
`128KB SRAM 512KB MRAM
`3.62mm2
`3.30mm2
`Area
`Read Latency
`2.252ns
`2.318ns
`Write Latency
`2.264ns
`11.024ns
`Read Energy
`0.895nJ
`0.858nJ
`Write Energy
`0.797nJ
`4.997nJ
`
`3.2 Modeling 3D NUCA Cache
`
`As the caches capacity and area increase, the wire delay
`has made the Non-Uniform Cache Access (NUCA) archi-
`tecture [18] more attractive than the conventional Uniform
`Cache Access (UCA) one. In NUCA, the cache is divided
`into multiple banks with different access latencies accord-
`ing to their locations relative to cores and these banks can be
`connected through a mesh-based Network-on-Chip (NoC).
`Extending the work of CACTI [2], we develop our NoC-
`based 3D NUCA model. The key concept is to use NoC
`routers for communications within planular layers, while
`using a specific through silicon bus (TSB) for commu-
`nications among different layers. Figure 3(a) illustrates
`an example of the 3D NUCA structure. There are four
`cores located in the core layer and 32 cache banks in each
`cache layer and all layers are connected by through silicon
`bus (TSB) which is implemented with TSVs. This intercon-
`nect style has the advantage of short connections provided
`by 3D integrations. It has been reported the vertical latency
`
`of traversing a 20-layer stack is only 12ps [23], thus the la-
`tency of TSB negligible compared to the latency of 2D NoC
`routers. Consequently, it is feasible to have single-hop ver-
`tical communications by utilizing TSBs.
`In addition, hy-
`bridization of 2D NoC routers with TSBs require one (in-
`stead of two) additional link on each NoC router, because
`TSB can move data both upward and downward [20].
`As shown inF igure3(a), cache layers are on top of
`core layers and they can either SRAM or MRAM caches.
`Figure3(b) shows a detailed 2D structure of cache layers.
`Every four cache banks are grouped together and routed to
`other layers via TSBs.
`Similar to prior approaches [7, 20], the proposed model
`supports data migration, which moves data closer to their
`accessing core. For set-associative cache, the cache ways
`belonging to the set should be distributed into different
`banks so that data migration can be implemented. In our
`3D NUCA model, each cache layer is equally divided into
`several zones. The number of zones is equal to the number
`of cores and each zone has a TSB located at its center. The
`cache ways of each set are uniformly distributed into these
`zone. This architecture promises that, within each cache
`set, there are several ways of cache lines close to the ac-
`tive core. Fig. 2 gives an illustration of distributing eight
`ways into four zones. Fig. 3(a) shows an example of data
`migration after which the core in the upper-left corner can
`access the data faster. In this paper, this kind of data migra-
`tions is called inter-migration to differentiate another kind
`of migration policy introduced later.
`The advantages of this 3D NUCA cache are:(1) plac-
`ing L2 caches in separate layers makes it possible to in-
`tegrate MRAM with traditional CMOS process technology;
`(2) separating cores from caches simplifies the design of
`TSBs and routers because TSBs are now connected to cache
`controllers directly, and there is no direct connection be-
`241
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 4
`
`
`
`Table 3. Baseline configuration parameters
`Processors
`# of cores
`Frequency
`Power
`Issue Width
`Memory Parameters
`L1 cache
`
`8
`3GHz
`6W/core
`1 (in order)
`
`SRAM L2
`
`MRAM L2
`
`private, 16+16KB, 2-way,
`64B line, 2-cycle,
`write-through, 1 read/write port
`shared, 2MB (16x128KB),
`32-way, 64B line,
`read/write per bank : 7-cycle,
`write-back, 1 read/write port
`shared, 8MB (16x512KB),
`32-way, 64B line,
`read penalty per bank : 7-cycle,
`write penalty per bank : 33-cycle,
`write-back, 1 read/write port
`4 entry, retire-at-2
`4GB, 500-cycle latency
`
`Write buffer
`Main Memory
`Network Parameters
`# of Layers
`# of TSB
`Hop latency
`
`2
`8
`TSB 1 cycle, V hop 1 cycle
`H hop 1 cycle
`2-cycle
`Router Latency
`tween routers and cache controllers.
`We provide one TSB for each core in the model. Con-
`sidering that the TSV pitch size is reported to be only 4-
`10μm [23], thus even a 1024-bit bus (much wider than
`our proposed TSB) would only incur an area overhead of
`0.32mm2. In our study, the die area of an 8-core CMP is
`estimated to be 60mm2 (discussed later). Therefore, it is
`feasible to assign one TSB for each core and the TSV area
`overhead is negligible.
`
`3.3 Configurations and Assumptions
`
`Our baseline configuration is an 8-core in-order proces-
`sor using the Ultra SparcIII ISA. In order to predict the chip
`area, we investigate some die photos, such as Cell Proces-
`sor [16], Sun UltraSPARC T1 [19], etc. and estimate the
`area of an 8-core CMP without caches to be 60mm2. By
`using our modified version of CACTI [2], we further learn
`that one cache layer fits to either a 2M B SRAM or an 8M B
`MRAM L2 cache assuming each cache layer has the simi-
`lar area to that of core layer (60mm2). The configurations
`are detailed in Table 3. Note that the power of processors is
`estimated based on the data sheet of real designs [16, 19].
`We use the Simics toolset [24] for performance simu-
`lations. Our 3D NUCA architecture is implemented as an
`extended module in Simics. We use a few multi-threaded
`benchmarks from OpenMP2001 [3] and PARSEC [1] suites.
`
`Since the performance and power of MRAM caches are
`closely related to transaction intensity, we select some sim-
`ulation workloads as listed in Table 4 so that we have a wide
`range of transaction intensities to L2 caches. The average
`numbers of total transactions (TPKI)2 and write transac-
`tions (WPKI) of L2 caches are listed in Table 4. For each
`simulation, we fast forward to warm up the caches and then
`run 3 billion cycles. We use the total IPC of all the cores as
`the performance metric.
`Table 4. L2 transaction intensities
`Name
`TPKI WPKI
`galgel
`1.01
`0.31
`apsi
`4.15
`1.85
`equake
`7.94
`3.84
`fma3d
`8.43
`4.00
`swim
`19.29
`9.76
`streamcluster
`55.12
`23.326
`
`3.4 SNUCA and DNUCA
`
`Static NUCA (SNUCA) and Dymaic NUCA (DNUCA)
`are two different implementations of the NUCA architec-
`ture proposed by Kim, et al. [18]. SNUCA statically parti-
`tions the address space across cache banks, which are con-
`nected via NoC; DNUCA dynamically migrates frequently
`accessed blocks to the closest banks. These two NUCA im-
`plementations result in different access patterns and vari-
`able write intensities.
`In our later simulations, we use
`both SNUCA-SRAM and DNUCA-SRAM L2 caches as
`our baselines when evaluating the performance and power
`benefits of MRAM caches.
`
`4 Direct Replacing SRAM with MRAM as
`L2 Caches
`
`In this section, we directly replace SRAM L2 caches
`with MRAM ones that have the comparable area, and show
`that without any optimization, a naive MRAM replacement
`will harm both performance and power when the workload
`write intensity is high.
`
`4.1 Same Area Replacement
`
`As shown in Table 2, a 128KB SRAM bank has the simi-
`lar area as a 512KB MRAM bank does. Thereby, in order to
`keep the area of cache layers unchanged, it becomes reason-
`able to replace SRAM L2 caches with MRAM ones whose
`capacity is 3 times larger. We call this replacement strategy
`as “same area replacement”.
`Using this strategy, we integrate as many caches in the
`cache layers as possible. Considering our baseline SRAM
`
`2TPKI is the number of total transactions per 1K instructions and WPKI
`is the number of write transactions per 1K instructions.
`242
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 5
`
`
`
`065$0618&$
`
`005$0618&$
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`065$0'18&$
`
`005$0'18&$
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`DWH
`
`PDOL]HG0LVV5D
`
`1RUP
`
`VWUHDPFOXVWHU
`VZDSWLRQV
`IPDG
`VZLP
`VWUHDPFOXVWHU
`VZDSWLRQV
`IPDG
`VZLP
`Figure 4. The comparison of L2 caches access miss rates for SRAM L2 cache and MRAM L2 cache
`that have similar area. Larger capacity of MRAM cache results in smaller cache miss rates.
`
`065$0618&$ 005$0618&$ 065$0'18&$ 005$0'18&$
`
`065$0618&$ 005$0618&$ 065$0'18&$ 005$0'18&$
`
`
`
`
`
`
`
`
`
`
`RZHU
`
`1RUPDOL]HG3R
`
`VWUHDPFOXVWHU
`VZLP
`IPDG
`HTXDNH
`DSVL
`JDOJHO
`Figure 6. Power comparison of SRAM and
`MRAM L2 caches (Normalized by 2MB
`SNUCA SRAM cache).
`
`MRAM write operations, the high write intensity is re-
`flected by the performance loss. When the write intensity
`is sufficiently high, the resulting performance loss over-
`whelms the performance gain achieved by reduced L2 cache
`miss rate. This observation is further supported by com-
`parison between SNUCA and DNUCA. From Fig. 5, one
`can observe that performance degradation is more signif-
`icant when we use DNUCA MRAM caches because data
`migrations in DNUCA initiate more write operations than
`SNUCA does and thus cause high write intensities.
`To summarize, we conclude our first observation of using
`MRAM caches as:
`
`Observation 1 Replacing SRAM L2 caches directly with
`MRAM, which has the similar area but with a large ca-
`pacity, can reduce the access miss rate of the L2 cache.
`However, the long latency associated with the write op-
`erations to the MRAM cache has a negative impact on
`the performance. When the write intensity is high, the
`benefits caused by miss rate reductions could be off-
`set by the long latency of MRAM write operations and
`eventually result in performance degradation.
`
`4.3 Power Analysis
`
`The major contributors of the total power consumption
`in caches are leakage power and dynamic power:
`• Leakage Power: When process technology scales down
`to sub-90nm, the leakage power in CMOS technology be-
`comes dominant. Since MRAM is a non-volatile memory
`technology, there is no power supply to each MRAM cell
`and then MRAM cells do not consume any standby leak-
`
`243
`
`
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`VWUHDPFOXVWHU
`VZLP
`IPDG
`HTXDNH
`DSVL
`JDOJHO
`IPC comparison of SRAM and
`Figure 5.
`MRAM L2 caches(Normalized by 2M SNUCA
`SRAM cache).
`
`L2 cache has 16 banks and each cache bank has the ca-
`pacity of 128KB, we keep the number of banks unchanged
`but replace each 128KB SRAM L2 cache bank with a
`512KB MRAM cache bank. The read/write access time and
`read/write energy consumption are tabulated in Table 2 for
`both SRAM and MRAM.
`
`4.2 Performance Analysis
`
`Because the number of banks remains the same and our
`modified CACTI shows 128KB SRAM bank and 512KB
`MRAM bank have similar read latencies (2.252ns versus
`2.318ns in Table 2), the read latencies of the 2MB SRAM
`cache and the 8MB MRAM cache are similar as well. Since
`the MRAM cache capacity is 3 times larger, the access miss
`rate to the L2 cache decreases as shown inF ig. 4. On
`average, the miss rates are reduced by 19.0% and 12.5%
`for SNUCA MRAM cache and DNUCA MRAM cache, re-
`spectively.
`The IPC comparison is illustrated inF ig. 5. Caused by
`the large MRAM cache capacity, the L2 cache miss rate
`decrease improves the performance of the first two work-
`loads (“galgel” and “apsi”); however the performance of
`the rest four workloads is not improved as expected. On
`average, the performance degradation of SNUCA MRAM
`and DNUCA MRAM is 3.09% and 7.52% compared to their
`SRAM counterparts, respectively.
`This performance degradation of direct MRAM replace-
`ment can be explained by Table 4, where we can observe the
`write operation intensity (presented by WPKI) of “euqake”,
`“fma3d”, “swim”, and “streamcluster” is much higher than
`that of “galgel” and “apsi”. Due to the long latency of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 6
`
`
`
`age power. Therefore, we only consider peripheral circuit
`leakage power for MRAM caches and the leakage power
`comparison of 2MB SRAM and 8MB MRAM is listed in
`Table 5.
`
`Table 5. Leakage power of SRAM and MRAM
`caches at 80 ◦C
`Cache configurations
`Leakage power
`2MB 16 × 128KB SRAM cache
`2.089W
`8MB 16 × 512KB MRAM cache
`0.255W
`• Dynamic Power: The dynamic power estimation for the
`NUCA cache is described as follows. For each transac-
`tion, the total dynamic power is composed of the memory
`cell access power, the router access power, and the power
`consumed by wire connections. In this paper, these val-
`ues are either simulated by HSPICE or obtained from
`our modified version of CACTI. The access number of
`routers and the length of wire connections vary from the
`location of the requesting core and the requested cache
`lines.
`Fig. 6 shows the power comparison of SRAM and
`MRAM L2 caches. One can observe that:
`• For SRAM L2 caches, since the leakage power domi-
`nates, the total power for SNUCA SRAM and DNUCA
`SRAM are very close. On the contrary, the dynamic
`power dominates the MRAM cache power.
`• For all the workloads, MRAM caches consume less
`power than SRAM caches do. The average power sav-
`ings across all the workloads are about 78% and 68% for
`SNUCA and DNUCA, respectively. The power saving
`for DNUCA MRAM is smaller because of the high write
`intensity caused by data migrations. It is obvious that the
`“low leakage power” feature makes MRAM more attrac-
`tive to be used as large on-chip memory, especially when
`SRAM leakage power becomes worse with technology
`scaling.
`• The average power savings for the first four workloads
`are more than 80%. However, for the workload “stream-
`cluster”, the total power saving is only 63% and 30%
`for SNUCA and DNUCA, respectively, due to its much
`higher L2 cache write intensity (see Table 4).
`To summarize, our second conclusion of direct MRAM
`cache replacement is:
`
`Observation 2 Direct replacing the SRAM L2 cache with
`a MRAM cache, which has similar area but with larger
`capacity, can greatly reduce the leakage power. How-
`ever, when the write intensity is high, the dynamic
`power increases significantly because of the high en-
`ergy associated with the MRAM write operation and
`the amount of total power saving could be reduced.
`
`These two conclusions show that, if we directly replace
`SRAM caches with MRAM caches using “same area strat-
`egy”, the long latency and high energy consumption of
`
`MRAM write operations can offset the performance and
`power benefit brought by MRAM cache when the cache
`write intensity is high.
`
`5 Novel 3D-stacked cache architecture
`
`In this section we propose two techniques to mitigate
`the write operation problem of using MRAM caches: read-
`preemptive write buffer is employed to reduce the stall time
`caused by the MRAM long write latency; SRAM-MRAM hy-
`brid L2cac he is proposed to reduce the number of MRAM
`write operations and thereby improve both performance and
`power. Finally, we combine these two techniques together
`as an optimized MRAM cache architecture.
`
`5.1 Read-preemptive Write Buffer
`
`The first observation in Section shows that the long
`MRAM write latency has a serious impact on the perfor-
`mance. In the scenario where a write operation is followed
`by several read operations, the ongoing write operation may
`block the upcoming read operations and cause performance
`degradations. Although the write buffer design in modern
`processors works well for SRAM caches, our experiment
`result in Subsection 4.2 shows that this write buffer does
`not fit for MRAM caches due to the large variation between
`MRAM read latency and write latency. In order to make
`MRAM caches work efficiently, we explore the proper write
`buffer size and propose a “read-preemptive” management
`policy for it.
`
`5.1.1 The Exploration of the Buffer Size
`
`The choice of the buffer size is important. The larger the
`buffer size is, the more write operations can be hidden.
`Thereby, the number of stall cycles decreases. However, on
`the other hand, the larger the buffer size is, the longer time
`it takes to check whether there is a ‘”hit” in the buffer and
`then to access it. Furthermore, the design complexity and
`the area overhead also increase with the buffer size growth.
`Fig. 7 shows the relative IPC improvement by using differ-
`ent buffer sizes for workloads “streamcluster” and “swim”.
`Observing the simulation result, we choose the size of 20
`entries as the optimal MRAM write buffer size. Compared
`to the SRAM write buffer, which has only 4 entries (as listed
`in Table 3), the MRAM write buffer size is much larger and
`we use 20-entry write buffer for MRAM caches in the later
`simulations.
`
`5.1.2 Read-preemptive Policy
`
`Since the L2 cache can receive requests from from the up-
`per level memory (L1 cache) and the write buffer, a priority
`policy is necessary to solve the conflict that a read request
`244
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 7
`
`
`
`6ZLP
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`YHPHQW
`
`DWLYH,3&,PSUR
`
`5HOD
`
`6WUHDPFOXVWHU
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`YHPHQW
`
`DWLYH,3&,PSUR
`
`5HOD
`
`
`
`
`
`%XIIHUVL]H
`%XIIHU6L]H
`Figure 7. The impact of buffer size. The IPC improvement is normalized by that of 8M MRAM cache
`without write buffer
`and a write request compete for the execution right. For
`MRAM caches, write operation latencies are much larger
`than read latencies, thus our objective is to prevent write
`operations from blocking read operations. As a result, we
`have our first rule:
`
`the non-conditional preemption policy and α = 0% repre-
`sents the traditional write buffer. We can find that, for the
`workloads with low write intensities, such as “galgel” and
`“apsi”, the performance improves as α increases and the
`non-conditional preemption policy works the best. How-
`ever, for the benchmark with high write intensities, like
`“streamcluster”, the performance improves at the begin-
`ning but then degrades as α increases. Generally, in this
`paper, we set α = 50% to make our read-preemptive policy
`effective for all the workloads.
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`IPDG
`
`VZLP
`
`VWUHDPFOXVWHU
`
`
`
`
`
`
`
`
`
`
`
`
`
`ODWLYH,3&
`
`5HO
`
`
`Į
`Į
`Į
`Į
`Į
`Figure 8. The impact of α on the performance.
`The IPC values are normalized by that of us-
`ing the traditional policy.
`A counter is required in order to make the accomplish-
`ment degree aware to the cache controller. The counter re-
`sets to zero and begins to count the number of cycles when
`a retirement begins. The cache controller check the counter
`and decides whether to stall the retirement for the read re-
`quest. The area of 20 buffer entries can be evaluated as a
`cache whose size is 20 × 64Byte(less than 2KB). We use
`a 7 − bit counter to record the retirement accomplishment
`degree. Since the area of each 3D-stacked layer is around
`60mm2, the area overhead of our proposed read-preemptive
`write buffer is less than 1%. Similarly, the leakage power
`increase caused by this buffer is also negligible.
`Fig. 9 and Fig. 10 illustrates the performance and power
`improvement gained by our proposed read-preemptive write
`buffer. Compared to the IPC of SRAM baseline configu-
`rations, the average performance improvements are 9.93%
`245
`
`Rule1: The read operation always has the higher priority
`in a competition for the execution right.
`
`Additionally, consider there is a read request blocked
`by a write operation that is already in process, the MRAM
`write latency is so large that its retirement may block one or
`more read request for a long period and further causes per-
`formance degradations. In order to mitigate this problem,
`we propose another read-preemptive rule as follows:
`
`Rule2: When a read request is blocked by a write retire-
`ment and the write buffer is not full, the read request
`can trap and stall the write retirement if the preemp-
`tion condition (discussed later) is satisfied. Then, the
`read operation obtains the right of the execution to the
`cache. The stalled write retirement will retry later.
`
`Our proposed read-preemptive policy tries to execute
`MRAM read requests as early as possible, but the drawback
`is that some write retirements need to be re-executed and
`the possibility of full buffer increases. The pivot is to find
`a proper preemption condition. One extreme method is to
`stall the write retirement as long as there is a read request,
`which means that read requests can always be executed im-
`mediately. Theoretically, if the write buffer size is large
`enough, no read request will be blocked. However, since the
`buffer size is limited, the increased possibility of full buffer
`could also harm the performance.
`In some other cases,
`stalling write retirements for read requests are not always
`good. For example, if a write retirement almost finishes,
`no read request should stall the retirement process. Conse-
`quently, we propose to use the retirement accomplishment
`degree, denoted as α, as the preemption condition. The re-
`tirement accomplishment degree is the accomplishment per-
`centage of the ongoing write retirement, below which no
`preemption will occur.
`Fig. 8 compares the IPC of using different α in our
`read-preemptive policy. Note that α = 100% represents
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2155, p. 8
`
`
`
`065$0618&$
`
`005$0618&$
`
`005$0618&$
`UHDGSUHHPSWLYH
`
`065$0'18&$
`
`005$0'18&$
`
`005$0'18&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`
`
`
`
`
`
`
`
`
`
`3&
`
`1RUPDOL]HG,3
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`IPDG
`
`VZLP
`
`VWUHDPFOXVWHU
`
`IPDG
`E
`D
`Figure 9. The comparison of IPC among 2M SRAM, 8M MRAM with traditional write buffer, and 8M
`MRAM with read-preemptive write buffer (Normalized by that of SRAM).
`
`JDOJHO
`
`DSVL
`
`HTXDNH
`
`VZLP
`
`VWUHDPFOXVWHU
`
`065$0'18&$
`
`005$0'18&$
`
`005$0'18&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`
`
`
`
`ZHU
`
`1RUPDOL]HG3RZ
`
`065$0618&$
`
`005$0618&$
`
`005$0618&$
`UHDGSUHHPSWLYH
`
`
`
`
`
`
`
`
`
`
`
`
`
`ZHU
`
`1RUPDOL]HG3RZ
`
`JDOJHO
`DSVL
`HTXDNH
`IPDG
`VZLP
`VWUHDPFOXVWH