`
`709
`
`Performance Improvement of the: Memory
`Hierarchy of RISC-Systems by A,pplication
`of 3-D Technology
`Michael B. Kleiner, Stefan A. Kiihn, Peter Ramm, and Werner Weber
`
`Abstract- In this paper, the performance of the memory
`hierarchy of RISC-systems for implementations employing three-
`dimensional (3-D) technology is investigated. Relating to RISC-
`systems, 3-D technology enables the integration of multiple chip-
`layers of memory together with the processor in one 3-D IC.
`In a first step, the second-level cache can be realized in one
`3-D IC with processor and first-level cache. This results in a
`considerable reduction of the hit time of the second-level cache
`due to a decreased access time and a larger allowable bus-width
`to the second-level cache. In a further step, the main memory
`can be integrated which relieves restrictions with respect to the
`bus-width to main memory. The use of 3-D technology for system
`implementation is observed to have a significant impact on the
`optimum design and performance of the memory hierarchy.
`Based on an analytical model, performance improvements on the
`order of 20% to 25% in terms of the average time per instruction
`are evaluated for implementations employing 3-D technology over
`conventional ones. It is concluded that 3-D technology is very
`attractive for future RISC-system generations.
`Index Terms- RISC, memory hierarchy, cache, 3-D IC, 3-D
`technology, 3-D packaging, vertically integrated circuits, perfor-
`mance, modeling.
`
`I. INTRODUCTION
`RESENTLY, great efforts are being taken to develop
`
`P technologies for realizing three-dimensional integrated
`
`circuits (3-D IC) [1]-[8]. However, little effort has so far been
`spent on investigating applications where such 3-D structures
`may be advantageously employed. The intent of this work is
`to study the memory hierarchy of RISC-systems as a potential
`application of this novel technology
`In recent years, processor speeds have increased at a higher
`rate than memory speeds, which has led to a speed disparity
`between them. In high performance computers cache memories
`are used to reduce the effective memory access time. Since
`system performance is tightly coupled to that of the memory
`hierarchy, the design of the latter is one of the most important
`parts of the architecture. Several researchers have investigated
`various aspects of memory hierarchy design [9]-[15]. In all of
`these studies, the design space is subject to several constraints,
`
`Manuscnpt received October 23, 1995, revised May 5, 1996 This work was
`supported by the German Minister of Educahon and Research under Grant 0
`1 M 2926 This paper was presented at the 45th Electronic Components and
`Technology Conference, Las Vegas, NV, May 21-24, 1995
`M B Kleiner, S A Kuhn, and W Weber are with Siemens AC, Corporate
`R&D, Munchen, Germany
`P Ramm is with Fraunhofer Inst~tute for Solid State Technology, Munchen,
`Germany
`Publisher Item Identifier S 1070-9894(96)08097-8
`
`hd-Level Cache
`
`Fig. 1. Conventional implementation of a RISC-system.
`
`such as limited amount of transistors available for caching on
`the processor-chip, or restricted bus-width between processor-
`chip and external memory. The imposed restrictions have
`a significant impact on the achievable performance of the
`memory hierarchy and consequently on system performance.
`Employing 3-D technology, the design of the memory hier-
`archy proceeds under decisiively relieved constraints. Functions
`which are currently implemented external to the processor,
`such as second-level cache or main memory, may then be
`realized on one 3-D IC together with the processor. The
`goal of the present work is to demonstrate the basic im-
`pact the application of 3-D technology has on the design,
`optimization, and performance of the memory hierarchy of
`RISC-systems.
`
`11. IMPLEMENTATION OF THE MEMORY HIERARCHY
`In Fig. 1, a common implementation of a RISC-system is
`shown schematically. The processor-IC incorporates the pro-
`cessing unit and the first-level cache. The second-level cache
`consists of an array of static random access memory (SRAM)
`components which are located next to the processor. The main
`memory is composed of dynamic random access memory
`(DRAM) components. From an architectural point of view,
`such an implementation inherently has several bottlenecks
`which significantly impact the performance of the memory
`hierarchy (see also Fig. 2).
`Firstly, only a limited amount of area is available on the
`processor-chip for caching. In most cases, the available area
`
`1070-9894/96$05.00 0 1996 IEEE
`
`MICRON ET AL. EXHIBIT 1071
`Page 1 of 10
`
`
`
`710
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND
`
`MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`I
`
`1
`
`2nd-Level Cache
`
`I
`
`I
`
`Main Memory
`
`I
`
`I
`Fig. 2. Major architectural constrants of conventional implementations.
`
`is sufficient only for the realization of a small first-level cache.
`Other functions, such as the second-level cache are forced to
`be realized external to the processor-IC. To date, only one
`microprocessor has been presented that features a small (96
`kE3) second-level cache on the processor-IC [16].
`Secondly, a significant portion of the time required for
`a second-level cache access is due to signal transmission
`between processor-chip and SRAM’s.
`Thirdly, the bus-width between the second-level cache and
`the processor is limited due to pin constraints. As an example,
`for the MIPS R4400 this bus is 128-b wide (only data). If
`the block size of the first-level cache is 32 bytes or larger,
`two or more second-level cache accesses are required to load
`a cache line into the first-level cache. From an architectural
`point of view it may, therefore, be advantageous to increase the
`width of this interface. This, however, inevitably necessitates a
`corresponding increase in the number of pins of the processor
`package and is associated with higher cost.
`Fourthly, the bus-width to main memory is usually far
`smaller than the block size of the second-level cache. Load-
`ing a block from main memory into the second-level cache
`therefore requires several successive main memory accesses.
`The present work is based on a 3-D technology that enables
`the realization of multiple chip-layers on top of each other to
`yield 3-D IC’s. Signal transfer between adjacent chip-layers is
`enabled by very small (0-2-5 pm) vertical vias which allow
`interconnect-densities of more than 50,000/cm2 to be achieved.
`In the approaches which have recently been proposed for
`the realization of 3-D IC’s [1]-[8], the individual chip-layers
`are manufactured independently of each other. The upper
`chip-layers are thinned from the rear side to a thickness of
`approximately 10 pm prior to stacking to keep the length of the
`vertical interconnects short and to facilitate their fabrication.
`As an example, Fig. 3 shows a cross-sectional view of a
`sample 3-D IC comprising two chip-layers.
`
`sa, +
`Interconnect
`
`-Silicon
`-BOX
`- Bulk-Silicon
`
`Fig. 3. SEM-photomicrograph of a 3-D IC that consists of two chip-layers.
`The remaining silicon of the upper layer is approximately 2-pm thick and the
`polyimide in between the chip-layers 1 pm.
`
`The fundamental advantage of 3-D IC’s over conventional
`IC’s is the substantial increase in integration density. A 3-D
`IC with n layers may feature ( n - 1)-times more transistors
`than a conventional IC. Therefore, when 3-D technology is
`available, the amount of transistors on a single device will
`increase dramatically.
`From here on, we will refer to any functional unit as on-
`chip if it is integrated in one IC or 3-D IC with the processor
`and as off-chip if this is not the case.
`Using 3-D technology, a RISC-system may be implemented
`as shown in Fig. 4(a) where processor, first-level cache, and
`second-level cache are integrated on one 3-D IC. The signal
`transmission delay for second-level cache accesses is substan-
`tially reduced and the limitations concerning the bus-widih
`to the second-level cache are removed. In a further step, the
`main memory may in addition be integrated as illustrated in
`Fig. 4(b). In this case, the limitations concerning the bus-width
`to main memory are also removed.
`Recently, other 3-D packaging approaches besides those
`mentioned above have been presented. In [ 171-[21], schemes
`are investigated where chip-layers are first stacked and then
`electrically interconnected by sidewall-metallizations. Using
`such technologies, “cubes” with as many as 70 chip-layers
`have successfully been manufactured [19]. Schemes of this
`sort may also be employed for realizing system implemen-
`tations similar to those shown in Fig. 4. For example, the
`main memory DRAM components may be stacked using
`such technologies and mounted on the processor-IC using a
`high density interconnect technology (for example flip-chip).
`Depending on parameters such as the electrical performance of
`the interchip vias and the realizable interconnect densities, the
`investigations performed below may in principle also apply to
`systems fabricated using such technologies.
`Since thermal management of 3-D IC’s is frequently be-
`lieved to be a major problem, we will briefly address this issue
`here. Fig. 5 shows the modeled temperature variation along
`the vertical axis of a three-dimensionally integrated RISC-
`system comprising eleven chip-layers. The lowest layer of
`the 3-0 IC represents the microprocessor which is assumed
`to generate a power density of 15 W/cm2, the second and
`third contain cache memory with 0.75 W/cm2 each, and the
`remaining eight are DRAM’S with 0.15 W/cm2. The values
`
`MICRON ET AL. EXHIBIT 1071
`Page 2 of 10
`
`
`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`711
`
`2nd-Level Cacb
`
`3-D IC
`
`2nd-Level Cache
`
`st-level Cache
`
`(b)
`Implementations of a RISC-system using 3-D technology: (a) Inte-
`Fig. 4.
`gration of processor, first-level cache, and second-level cache on one 3-D
`IC and (b) integration of processor, first-level cache, second-level cache, and
`main memory on one 3-D IC.
`
`shown in the figure are temperature increases relative to the
`bottom surface of the 3-D IC and were computed based on
`thermal parameters which were measured on 3-D IC's of the
`type shown in Fig. 3 using specialized test structures. In the
`example, the overall temperature difference inside the 3-D IC
`is less than 1.3 OC. The temperature drop is therefore very
`small which is primarily due to the fact that the glue layers are
`extremely thin and that the power densities in the upper chip-
`layers are relatively small. Also, since the processor dominates
`the total power consumption, the power density that has to be
`removed from the 3-D IC is only 18% larger than that of
`the processor itself. Consequently, we do not expect thermal
`management of 3-D IC's to become a major point of concern
`for applications such as the one investigated in the present
`work.
`
`111. PERFORMANCE MODEL
`A model is used to evaluate the performance of the various
`types of implementations. The average time per instruction
`(ATPI) is used as performance measure. To compute the ATPI
`the following factors are required
`
`&
`100
`I50 200 '250 300 350 40;
`'0
`SO
`Lacation along z-axis @m]
`Fig. 5. Modeled temperature variation along the vertical axis of a 3-D
`integrated RISC-system compnsing a processor in the lowest chip-layer, cache
`in the subsequent two, and DRAM'S in the upper eight layers. The values
`shown in the figure represent temperature increases relative to the bottom
`surface of the 3-D IC.
`
`1) miss rates as a funcition of cache size, associativity, and
`block size;
`2) access times for on-chip caches as a function of cache
`size, cache organization, associativity, block size, and
`technology as well as access times for off-chip caches
`and main memory iiccesses; and
`3) performance model that yields the average time per
`instruction as a function of m i s s rates and access times
`of the individual levels of the hierarchy.
`
`A. Miss Rates
`Miss rates measured by Gee et al. [22] for the SPEC92
`benchmark suite are used throughout this study. Gee et al.
`evaluated miss rates for all six integer-intensive and all 14
`floating-point-intensive paograms of the suite. Instruction, data,
`and unified caches ranging from 1 kB to 1 MB with block
`sizes of 16, 32, 64, 128, and 256 Bytes, and associativities of
`1, 2, 4, and 8 were investigated. The replacement algorithm is
`least recently used (LRU). All miss rates are given in terms
`of misses per references.
`For the most part of this work, the average miss rates
`for the complete SPEC92 benchmark suite are used. The
`rule given by Higbie [23], namely that doubling cache size
`reduces miss rate by 25% was found to apply well to these
`measured values. Since we are also interested in performance
`evaluations for caches larger than 1 MB, Higbie's rule is used
`to extrapolate miss rates for 2 MB and 4 MB large caches.
`Fig. 6 shows some average miss rates for a direct-mapped and
`a two-way set-associative unified cache as a function of cache
`size.
`To investigate the dependence of memory hierarchy perfor-
`mance on miss rates, simulations are also performed for pro-
`grams exhibiting significantly higher miss rates. The floating-
`point-intensive programs swm256, a shallow water equation
`solver, and the tomcatv work load, a mesh-generator, were
`selected for this purpose. In Fig. 7, some of the measured
`m i s s rates of these programs are shown. As can be seen, miss
`rates are higher than thoste in Fig. 6. At a cache size of 1 MB
`miss rates are still above or in the range of 1%. The humped
`behavior of the curves is due to the nature of the programs.
`
`MICRON ET AL. EXHIBIT 1071
`Page 3 of 10
`
`
`
`712
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`+ Direct-Mapped
`
`I
`
`100 I
`
`- 10
`
`I
`
`3
`U PI
`2 1
`
`O ' - N d 00
`
`~0;1$"" r t $ g g
`r ? 2 m n , * *
`Cache Size [kB]
`Average miss rates for the complete SPEC92 benchmark suite for
`Fig. 6.
`unified caches with a block size of 32 B.
`
`100
`
`- 10
`
`3
`
`I
`
`OJ-
`
`P l z 2 3
`N d CO Z $ = l = f " "
`" 2
`Cache S u e [kB]
`Miss rates for the swm256 and the tomcatv program for unified
`Fig. 7.
`caches with a block size of 32 Bytes [22].
`
`B. Access Times
`Access times of on-chip caches are computed on the basis
`of the analytical equations derived by Wilton and Jouppi [24]
`which are an extension of the model proposed by Wada et
`al. 1251. Wilton and Jouppi compute the access time of a
`cache from the delays of datdtag decoder, datdtag word line,
`datdtag bit line, datdtag sense amplifier, comparator, valid
`signal driver, data output driver, precharge, and multiplexor
`in the case of a set-associative cache. (We will use the
`term access time throughout this work even though strictly
`speaking the cycle time, which is the minimum time between
`the start of two successive accesses, is being referred to).
`Equations are expressed as a function of cache size, block
`size, associativity, data output width, process parameters,
`and array configuration parameters such as the number of
`sections per word line or bit line. The array parameters
`are always optimized to yield minimum access time. We
`modified the model of Wilton and Jouppi by inserting cascaded
`drivers at several locations and by implementing a different
`decoding scheme to also suit it for the evaluation of large
`caches and caches with a large number of output bits. In
`addition, we scaled parameters to match a high performance
`0.5 ,um complementary metal-oxide-semiconductor (CMOS)
`technology. Fig. 8 shows calculated access times for a direct-
`mapped and a two-way set-associative cache as a function of
`
`2 /
`
`Fig. 8. Access time of a direct-mapped and a two-way set-associative cache
`as a function of its size. The block size is 16 B and the number of output
`bits is 64.
`
`cache size. Access times increase with cache size and are larger
`for the two-way set-associative cache due to the additional
`circuitry.
`Now let us consider off-chip cache accesses. It is assumed
`that these are pipelined into three stages. In the first stage,
`the address is sent from the processor to the SRAM's, in the
`second stage the actual SRAM access is performed, and in the
`third stage the data are sent from the SRAM's to the processor.
`The minimum time required for a pipeline stage is determined
`by the stage that exhibits the longest delay. Assuming an
`
`SRAM access time (TACCSRAM) of 7 ns, investigations for
`a specific PCB implementation [26] show that TACC~RAM
`is larger than the time required for sending the address and
`that for sending the data. Therefore, the minimum length of
`the pipeline stage is set by the access time of the SRAM
`TACCSRAM. When the off-chip second-level cache is accessed
`the first data is available after three pipeline intervals and
`successive data are available at the end of each consecutive
`pipeline interval.
`Main memory accesses are characterized by two time con-
`
`stants: TMMLAT and TFPCYC. TMMLAT is a latency that is
`followed by a period of accesses. The time required between
`successive accesses is TFPCYC and is primarily determined
`by the timing of the DRAM'S.
`
`C. Average Time Per Instruction
`The memory hierarchies investigated in this work contain
`two levels of cache. Fig. 9 shows a block dagram of the basic
`memory hierarchy organization. Both caches are inserted in
`series between processor and main memory. The first-level
`cache is split into data- and instruction-caches of equal size:
`
`S~c1/2. The access time T L C ~ is evaluated using the access
`time model for on-chip caches as a function of cache size,
`block size, and the number of output bits. Throughout this
`work it is assumed that the first-level cache access is pipelined
`into two stages and that the processor cycle time T C ~ C is
`determined by
`
`TLCI
`TCYC =
`(1)
`~ 2 .
`Therefore, a variation in size or block size of the first-level
`cache impacts the processor cycle time. The data bus-width
`
`MICRON ET AL. EXHIBIT 1071
`Page 4 of 10
`
`
`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`~
`
`713
`
`CPU
`
`I
`
`Split First-Level Cache &Cl)
`Inpbnctmtion. i Oil-Cbip, 2 Pipeline Stages
`i losnuclion: S,,~,Dut~l: %,!2
`Cache Size
`Oigwir*tiau:
`
`l a c k Size Btt,, Dircrl-R4appcd
`
`where MRI-Lc~ and MRD-Lc~ represent the miss rates of
`instruction and data cache, respectively. DMRPI stands for
`data memory references per instruction.
`On a second-level cache hit, a B L C l large block of data is
`read from the second-level cache and transferred to the first-
`level cache. The hit HTLc;! is calculated from (4), shown at the
`bottom of the page. This equation requires some elaboration.
`The SRAM access time T~~CCSRAM
`and the access time of the
`on-chip second-level cache T L C ~ are rounded to next higher
`integer multiples of the processor cycle time. For an off-chip
`second-level cache, the width of the data bus to the second-
`level cache, B W L C ~ ,
`is equal to B W O F F L ~ ~ which is fixed
`to 128 bits in our study. 'Therefore, the larger the block size
`BLCI of the first-level cache, the larger the hit time HTLc:!
`due to the successive accesses required to load a block. For
`an on-chip second-level cache the bus-width constraints are
`removed and it is assumed that the data bus-width BWLCS
`is selected to match the block size of the first-level cache.
`Consequently, only one second-level cache access is required
`to load a block into the lirst-level cache independent of the
`block size B L C ~ . This results in a substantial reduction of the
`hit time HTLc~ over that of an off-chip cache. We note that
`some of this decrease could also be achieved by pipelining the
`on-chip cache, however, widening the interface to the block
`size of the first-level cache yields the lowest hit time and is
`assumed throughout this study to demonstrate the potential of
`
`3-D technology. The term TLC~ in (4) is due to the time that
`between first and second-level cache is B W L C ~ . A write buffer
`is required to write the last portion of data in the first-level
`with a sufficient depth that it essentially never fills up, is
`cache. (All other writes are performed in parallel with read
`located between first- and second-level cache. The second-
`accesses of the second-level cache.)
`level cache is a unified cache and has a size SLC~,
`a block
`The penalty for a second-level cache miss is evaluated in a
`
`size B L C ~ , and an associativity A L C ~ . It is either implemented
`similar fashion using (3, shown at the bottom of the page.
`on-chip or off-chip. Second-level cache and main memory are
`Again, timing parameters are rounded to next higher integer
`connected by a BWMM bit wide data bus.
`multiples of the processor cycle time. For off-chip main
`The system is assumed to issue one instruction per clock
`memory the data bus-width to main memory, BWMM, is
`cycle, neglecting the effects of cache misses. Therefore, the
`selected to BWOFFMM which is 128 b. BLC~IBWOFFMM
`base CPI (cycles per instruction) is one. The ATPI is expressed
`successive accesses are required to load a block in the second-
`as
`level cache. In the case of on-chip main memory, the data
`bus-width BWMM is selected to match the block size of the
`ATPI = TLcl+ MRPI MRLc~HTLc~ + MRPI MRLC~MPLCZ
`second-level cache. Then only a single main memory access
`is required to load a block. from main memory independent of
`(2)
`In (5), TWRLCZ is the time to write the last BWMM
`B L C ~ .
`where HT stands for hit time, MR for miss rate, MP for
`
`data bits in the second-level cache. TWRLCZ is equal to
`m i s s penalty, and MRPI memory references per instruction.
`[TLcz/Tcyc] TCYC for an on-chip second-level cache and
`Since the first-level cache consists of split instruction- and
`equal to 3 rT'CCSRAM/TCyc]TCyC
`for an off-chip second-
`data caches, the miss rate MRLc~ is given by
`level cache.
`The model parameters used throughout this work are sum-
`marized in Table I.
`
`Main Memory
`Imptanentationf OR-Chip ne On-Chip
`
`Block diagram of the investigated memory hierarchy. (Bus-widths
`Fig. 9.
`refer to data bits only.)
`
`HTLC2 =
`
`E
`
`(2 + BWOy&cz ) [ T A g z M 1 Tcyc + T L C ~ , off-chip second-level cache
`
`a TCYC + TLCI,
`1 F C Y C 1
`on-chip seecond-level cache
`
`(4)
`
`MICRON ET AL. EXHIBIT 1071
`Page 5 of 10
`
`
`
`714
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`TABLE I
`MODEL PARAMETERS
`
`TACCSRAM
`
`TAMLAT
`TFPCYC
`
`7 ns
`50 ns
`30 ns
`
`I
`
`DMRF'I
`d e s s specified otherwise
`
`0.34
`
`We would like to note that in the presented memory
`hierarchy performance model, the impact of more complicated
`fetch strategies such as early continuation, load forwarding, or
`streaming, which allow the processor to continue before the
`entire block has been fetched, are neglected. Przybylski [27]
`found that the simplest fetch strategy, as used in the above
`model, works well and that the performance improvement
`of the more intricate ones is less than indicated by the
`accompanying reduction in m i s s ratios because of limited
`memory resources and temporal clustering.
`
`IV. CASE STUDIES
`
`A. Conventional Implementation
`Our investigations start with a conventionally implemented
`system. The first-level cache is direct-mapped and imple-
`mented on-chip, the second-level is also direct-mapped and
`realized off-chip. Using the performance model, the average
`time per instruction is evaluated as a function of first- and
`second-level cache sizes. Simulation results are shown in
`Fig. 10. The ATPI generally decreases with increasing size
`of the second-level cache due to the associated reduction in
`miss rates. The absolute improvement, however, diminishes
`with increasing second-level cache size, since the reduction of
`miss rate becomes smaller.
`The dependence of ATPI on first-level cache size is a little
`more complex. For small first-level caches the processor cycle
`trme is short, but due to the resulting high mss rate and the
`large m i s s penalties implied by the off-chip second-level cache
`access, the ATPI is relatively large at any second-level cache
`size. As the first-level cache size is increased, the lower miss
`rates more than compensate for the increase in processor cycle
`time, such that the ATPI decreases. As the first-level cache size
`is further increased a point is reached, where the reduction in
`miss rate is no longer able to compensate for the increase
`in first-level cache access time and the associated increase
`in processor cycle time. The level of the ATPI then starts
`to rise again. In Fig. 10 this point is reached at a first-level
`cache size of 128 kE3. Therefore, increasing the size of the
`first-level cache beyond this point degrades performance and
`
`Size of Second-Level Cache [kB]
`Performance of a conventionally implemented system with on-chip
`Fig. 10.
`first-level cache and off-chip second-level cache.
`
`consequently the use of 3-D technology to realize large on-chip
`first-level caches does not appear to offer any advantage. Note
`that this conclusion has been derived under the assumption of
`a first-level cache with two pipeline stages. The dependence
`of processor cycle time on cache access time can basically
`be reduced by further pipelining the cache access. Optimum
`first-level cache sizes are then larger. However, this also has
`negative effects such as the degradation of CPI due to load
`delays and branch delays and is beyond the scope of this work.
`
`B. Moving Second-Level Cache and Main Memory On-Chip
`If the second-level cache is moved on-chip its access time is
`reduced. In addition, the bus-width between first- and second-
`level cache may be selected equal to the block size of the
`first-level cache. Therefore, on a second-level cache hit only
`one second-level cache access is required to fill a first-level
`cache line instead of several successive accesses. Fig. 11
`shows performance results for an implementation with an on-
`chip direct-mapped second-level cache. The deviations from
`smooth behavior that can be seen in the figure are due to
`the rounding of timing parameters to integer multiples of the
`processor cycle time. In comparison with the conventional
`implementation investigated above, the ATPI is much less
`dependent on the size of the first-level cache. This is a result of
`the substantial reduction of the penalty associated with second-
`level cache accesses. Also, the individual curves are lower, that
`is, the ATPI is generally reduced. Furthermore, the optimum
`size of the first-level cache is 32 kE3 and therefore smaller than
`for the conventional implementation.
`Fig. 12 allows a direct quantitative comparison of the per-
`formance of four different implementations (two of which have
`so far not been discussed). In the figure, for every second-
`level cache size of each implementation, the first-level cache
`size was chosen to yield minimum (optimum) ATPI and only
`the corresponding ATPI value is shown. The performance
`improvement achieved when moving the second-level cache
`on-chip can be seen by comparing the two upper curves. At
`a second-level cache size of 1 MB the ATPI is reduced from
`4 ns to 3.5 ns (13%).
`Set associative caches generally exhibit lower miss rates
`than direct-mapped caches of the same size. A set-associative
`organization of an off-chip cache is usually associated with
`
`MICRON ET AL. EXHIBIT 1071
`Page 6 of 10
`
`
`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`715
`
`-
`
`Y
`
`Total Size of First-Level Cache
`-2kB
`-32kB
`-+-512kB
`-4kB
`-1024kB
`-64kB
`-8kB
`-128kB
`t 2 0 4 8 k B
`-16kB
`-256kB
`
`Implementation Types:
`-t- LC2 Off-Chip DM (Cow.)
`-e- LC2 On-Chio Ass. 2 (3-D IC)
`
`
`~I
`
`I
`
`'
`
`,
`
`I
`
`I
`
`Size of Second-Level Cache [kB]
`Fig. 11. Performance of an implementation with a direct-mapped on-chip
`second-level cache realized using 3-D technology. (LC2 = second-level
`cache, DM = direct-mapped, Ass. 2 = two-way set-associative, and
`MM = main memory.)
`
`12 I
`
`Implementation Types:
`--C LC2 Off-Chi0 DM (Conv.)
`---ft U32 On-Chip DM (3-D IC)
`-+- U32 On-Chip Ass. 2 (3-D IC)
`---b LC2 On-Chip Ass. 2, MM on-chip (3-D IC)
`
`I
`
`I
`I
`
`
`
`Size of Second-Level Cache [kB]
`Fig. 12. Performance comparison of four implementations.
`
`a prohibitively large penalty in terms of increase in hit
`time in comparison with a direct-mapped cache. For an on-
`chip cache, the increase in hit time can be small enough to
`justify the choice of a set-associative cache. The performance
`improvement achievable by organizing the on-chip second-
`level cache as two-way set-associative can be seen from
`Fig. 12. Especially for the smaller second-level cache sizes,
`the two-way set-associative organization is clearly superior
`to the direct-mapped one. This results from the fact that the
`improvement in miss rate when moving from a direct-mapped
`to a set-associative cache organization is more pronounced for
`small caches than for large ones.
`Simulations were also performed for the case that the
`main memory is additionally integrated on-chip. The timing
`parameters of the main memory TMMLAT and TFPCYC were
`assumed to be unchanged by this modification. The bus-width
`to main memory, BWMM, was selected equal to the block size
`of the second-level cache and the main memory was presumed
`to be organized with the same width. Therefore, in contrast to
`the above implementations, only one main memory access is
`required to load a complete cache line from main memory
`into the second-level cache. Results for an implementation,
`where the second-level cache is organized as two-way set-
`associative are also shown in Fig. 12. The reduced penalty for
`main memory accesses leads to a further reduction of the ATPI
`
`s
`
`d
`ri
`m
`w
`C!
`Block Size of First-Level Cache [B]
`
`M
`
`W
`ICI
`(U
`
`Dependence of performance on the block size of the first-level cache
`Fig. 13.
`for two implementations with a 1 MB large second-level cache.
`
`particularly for the smaller second-level cache sizes since these
`are associated with more frequent main memory accesses.
`
`C. Dependence of Peiformance on Block Size
`It is a known fact that imiss rates decrease as block sizes
`increase until the so-called "pollution point" is reached [9]. In
`conventional implementations where bus-widths are limited,
`large block sizes are associated with large miss penalties
`due to the successive accesses required for loading a cache
`line. Since block sizes iiIe chosen to minimize ATPI and
`not misses, optimum bloclk sizes are smaller than desirable
`from a miss rate point of view. However, if the second-level
`cache is realized on-chip the block size of the first-level cache
`may be selected larger without an increase in miss penalty.
`Consequently, optimum block sizes are expected to be larger.
`Fig. 13 shows the ATPI as a function of the block size of
`the first-level cache for ari implementation with an off-chip
`second-level cache as well iis one with an on-chip second-level
`cache. In the former case the optimum block size is 32 B and
`the ATPI is degraded if the block size is increased beyond
`this point. For the implemlentation with on-chip second-level
`cache, the strong deterioration of ATPI at the larger block
`sizes does not occur and optimum performance is achieved at
`a block size of 128 B.
`
`D. Dependence of Pe$ormiance Improvement on Miss Rates
`In Fig. 14, the performance of the four types of implemen-
`tations discussed above is shown as a function of second-level
`cache size for the floating -point-intensive programs swm256
`and tomcatv. The respective performances for the average
`miss rates of the complele SPEC92 benchmark suite were
`shown in Fig. 12. Especiallly for large cache sizes these two
`programs exhibit miss rates substantially larger than those of
`the complete benchmark suite (see Figs. 6 and 7).
`Let us first consider the swm256 work load. Moving the
`direct-mapped second-level cache on-chip yields a reduction
`of ATPI in the range of 10-15% for most cache sizes.
`Organizing this cache two-way set-associative does not lead to
`any considerable further irnprovement because miss rates are
`very similar for the two types of organizations (cf. Fig. 7).
`Howe