throbber
IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURlNG TECHNOLOGY-PART B, VOL 19, NO 4, NOVEMBER 1996
`
`709
`
`Performance Improvement of the: Memory
`Hierarchy of RISC-Systems by A,pplication
`of 3-D Technology
`Michael B. Kleiner, Stefan A. Kiihn, Peter Ramm, and Werner Weber
`
`Abstract- In this paper, the performance of the memory
`hierarchy of RISC-systems for implementations employing three-
`dimensional (3-D) technology is investigated. Relating to RISC-
`systems, 3-D technology enables the integration of multiple chip-
`layers of memory together with the processor in one 3-D IC.
`In a first step, the second-level cache can be realized in one
`3-D IC with processor and first-level cache. This results in a
`considerable reduction of the hit time of the second-level cache
`due to a decreased access time and a larger allowable bus-width
`to the second-level cache. In a further step, the main memory
`can be integrated which relieves restrictions with respect to the
`bus-width to main memory. The use of 3-D technology for system
`implementation is observed to have a significant impact on the
`optimum design and performance of the memory hierarchy.
`Based on an analytical model, performance improvements on the
`order of 20% to 25% in terms of the average time per instruction
`are evaluated for implementations employing 3-D technology over
`conventional ones. It is concluded that 3-D technology is very
`attractive for future RISC-system generations.
`Index Terms- RISC, memory hierarchy, cache, 3-D IC, 3-D
`technology, 3-D packaging, vertically integrated circuits, perfor-
`mance, modeling.
`
`I. INTRODUCTION
`RESENTLY, great efforts are being taken to develop
`
`P technologies for realizing three-dimensional integrated
`
`circuits (3-D IC) [1]-[8]. However, little effort has so far been
`spent on investigating applications where such 3-D structures
`may be advantageously employed. The intent of this work is
`to study the memory hierarchy of RISC-systems as a potential
`application of this novel technology
`In recent years, processor speeds have increased at a higher
`rate than memory speeds, which has led to a speed disparity
`between them. In high performance computers cache memories
`are used to reduce the effective memory access time. Since
`system performance is tightly coupled to that of the memory
`hierarchy, the design of the latter is one of the most important
`parts of the architecture. Several researchers have investigated
`various aspects of memory hierarchy design [9]-[15]. In all of
`these studies, the design space is subject to several constraints,
`
`Manuscnpt received October 23, 1995, revised May 5, 1996 This work was
`supported by the German Minister of Educahon and Research under Grant 0
`1 M 2926 This paper was presented at the 45th Electronic Components and
`Technology Conference, Las Vegas, NV, May 21-24, 1995
`M B Kleiner, S A Kuhn, and W Weber are with Siemens AC, Corporate
`R&D, Munchen, Germany
`P Ramm is with Fraunhofer Inst~tute for Solid State Technology, Munchen,
`Germany
`Publisher Item Identifier S 1070-9894(96)08097-8
`
`hd-Level Cache
`
`Fig. 1. Conventional implementation of a RISC-system.
`
`such as limited amount of transistors available for caching on
`the processor-chip, or restricted bus-width between processor-
`chip and external memory. The imposed restrictions have
`a significant impact on the achievable performance of the
`memory hierarchy and consequently on system performance.
`Employing 3-D technology, the design of the memory hier-
`archy proceeds under decisiively relieved constraints. Functions
`which are currently implemented external to the processor,
`such as second-level cache or main memory, may then be
`realized on one 3-D IC together with the processor. The
`goal of the present work is to demonstrate the basic im-
`pact the application of 3-D technology has on the design,
`optimization, and performance of the memory hierarchy of
`RISC-systems.
`
`11. IMPLEMENTATION OF THE MEMORY HIERARCHY
`In Fig. 1, a common implementation of a RISC-system is
`shown schematically. The processor-IC incorporates the pro-
`cessing unit and the first-level cache. The second-level cache
`consists of an array of static random access memory (SRAM)
`components which are located next to the processor. The main
`memory is composed of dynamic random access memory
`(DRAM) components. From an architectural point of view,
`such an implementation inherently has several bottlenecks
`which significantly impact the performance of the memory
`hierarchy (see also Fig. 2).
`Firstly, only a limited amount of area is available on the
`processor-chip for caching. In most cases, the available area
`
`1070-9894/96$05.00 0 1996 IEEE
`
`MICRON ET AL. EXHIBIT 1071
`Page 1 of 10
`
`

`
`710
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND
`
`MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`I
`
`1
`
`2nd-Level Cache
`
`I
`
`I
`
`Main Memory
`
`I
`
`I
`Fig. 2. Major architectural constrants of conventional implementations.
`
`is sufficient only for the realization of a small first-level cache.
`Other functions, such as the second-level cache are forced to
`be realized external to the processor-IC. To date, only one
`microprocessor has been presented that features a small (96
`kE3) second-level cache on the processor-IC [16].
`Secondly, a significant portion of the time required for
`a second-level cache access is due to signal transmission
`between processor-chip and SRAM’s.
`Thirdly, the bus-width between the second-level cache and
`the processor is limited due to pin constraints. As an example,
`for the MIPS R4400 this bus is 128-b wide (only data). If
`the block size of the first-level cache is 32 bytes or larger,
`two or more second-level cache accesses are required to load
`a cache line into the first-level cache. From an architectural
`point of view it may, therefore, be advantageous to increase the
`width of this interface. This, however, inevitably necessitates a
`corresponding increase in the number of pins of the processor
`package and is associated with higher cost.
`Fourthly, the bus-width to main memory is usually far
`smaller than the block size of the second-level cache. Load-
`ing a block from main memory into the second-level cache
`therefore requires several successive main memory accesses.
`The present work is based on a 3-D technology that enables
`the realization of multiple chip-layers on top of each other to
`yield 3-D IC’s. Signal transfer between adjacent chip-layers is
`enabled by very small (0-2-5 pm) vertical vias which allow
`interconnect-densities of more than 50,000/cm2 to be achieved.
`In the approaches which have recently been proposed for
`the realization of 3-D IC’s [1]-[8], the individual chip-layers
`are manufactured independently of each other. The upper
`chip-layers are thinned from the rear side to a thickness of
`approximately 10 pm prior to stacking to keep the length of the
`vertical interconnects short and to facilitate their fabrication.
`As an example, Fig. 3 shows a cross-sectional view of a
`sample 3-D IC comprising two chip-layers.
`
`sa, +
`Interconnect
`
`-Silicon
`-BOX
`- Bulk-Silicon
`
`Fig. 3. SEM-photomicrograph of a 3-D IC that consists of two chip-layers.
`The remaining silicon of the upper layer is approximately 2-pm thick and the
`polyimide in between the chip-layers 1 pm.
`
`The fundamental advantage of 3-D IC’s over conventional
`IC’s is the substantial increase in integration density. A 3-D
`IC with n layers may feature ( n - 1)-times more transistors
`than a conventional IC. Therefore, when 3-D technology is
`available, the amount of transistors on a single device will
`increase dramatically.
`From here on, we will refer to any functional unit as on-
`chip if it is integrated in one IC or 3-D IC with the processor
`and as off-chip if this is not the case.
`Using 3-D technology, a RISC-system may be implemented
`as shown in Fig. 4(a) where processor, first-level cache, and
`second-level cache are integrated on one 3-D IC. The signal
`transmission delay for second-level cache accesses is substan-
`tially reduced and the limitations concerning the bus-widih
`to the second-level cache are removed. In a further step, the
`main memory may in addition be integrated as illustrated in
`Fig. 4(b). In this case, the limitations concerning the bus-width
`to main memory are also removed.
`Recently, other 3-D packaging approaches besides those
`mentioned above have been presented. In [ 171-[21], schemes
`are investigated where chip-layers are first stacked and then
`electrically interconnected by sidewall-metallizations. Using
`such technologies, “cubes” with as many as 70 chip-layers
`have successfully been manufactured [19]. Schemes of this
`sort may also be employed for realizing system implemen-
`tations similar to those shown in Fig. 4. For example, the
`main memory DRAM components may be stacked using
`such technologies and mounted on the processor-IC using a
`high density interconnect technology (for example flip-chip).
`Depending on parameters such as the electrical performance of
`the interchip vias and the realizable interconnect densities, the
`investigations performed below may in principle also apply to
`systems fabricated using such technologies.
`Since thermal management of 3-D IC’s is frequently be-
`lieved to be a major problem, we will briefly address this issue
`here. Fig. 5 shows the modeled temperature variation along
`the vertical axis of a three-dimensionally integrated RISC-
`system comprising eleven chip-layers. The lowest layer of
`the 3-0 IC represents the microprocessor which is assumed
`to generate a power density of 15 W/cm2, the second and
`third contain cache memory with 0.75 W/cm2 each, and the
`remaining eight are DRAM’S with 0.15 W/cm2. The values
`
`MICRON ET AL. EXHIBIT 1071
`Page 2 of 10
`
`

`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`711
`
`2nd-Level Cacb
`
`3-D IC
`
`2nd-Level Cache
`
`st-level Cache
`
`(b)
`Implementations of a RISC-system using 3-D technology: (a) Inte-
`Fig. 4.
`gration of processor, first-level cache, and second-level cache on one 3-D
`IC and (b) integration of processor, first-level cache, second-level cache, and
`main memory on one 3-D IC.
`
`shown in the figure are temperature increases relative to the
`bottom surface of the 3-D IC and were computed based on
`thermal parameters which were measured on 3-D IC's of the
`type shown in Fig. 3 using specialized test structures. In the
`example, the overall temperature difference inside the 3-D IC
`is less than 1.3 OC. The temperature drop is therefore very
`small which is primarily due to the fact that the glue layers are
`extremely thin and that the power densities in the upper chip-
`layers are relatively small. Also, since the processor dominates
`the total power consumption, the power density that has to be
`removed from the 3-D IC is only 18% larger than that of
`the processor itself. Consequently, we do not expect thermal
`management of 3-D IC's to become a major point of concern
`for applications such as the one investigated in the present
`work.
`
`111. PERFORMANCE MODEL
`A model is used to evaluate the performance of the various
`types of implementations. The average time per instruction
`(ATPI) is used as performance measure. To compute the ATPI
`the following factors are required
`
`&
`100
`I50 200 '250 300 350 40;
`'0
`SO
`Lacation along z-axis @m]
`Fig. 5. Modeled temperature variation along the vertical axis of a 3-D
`integrated RISC-system compnsing a processor in the lowest chip-layer, cache
`in the subsequent two, and DRAM'S in the upper eight layers. The values
`shown in the figure represent temperature increases relative to the bottom
`surface of the 3-D IC.
`
`1) miss rates as a funcition of cache size, associativity, and
`block size;
`2) access times for on-chip caches as a function of cache
`size, cache organization, associativity, block size, and
`technology as well as access times for off-chip caches
`and main memory iiccesses; and
`3) performance model that yields the average time per
`instruction as a function of m i s s rates and access times
`of the individual levels of the hierarchy.
`
`A. Miss Rates
`Miss rates measured by Gee et al. [22] for the SPEC92
`benchmark suite are used throughout this study. Gee et al.
`evaluated miss rates for all six integer-intensive and all 14
`floating-point-intensive paograms of the suite. Instruction, data,
`and unified caches ranging from 1 kB to 1 MB with block
`sizes of 16, 32, 64, 128, and 256 Bytes, and associativities of
`1, 2, 4, and 8 were investigated. The replacement algorithm is
`least recently used (LRU). All miss rates are given in terms
`of misses per references.
`For the most part of this work, the average miss rates
`for the complete SPEC92 benchmark suite are used. The
`rule given by Higbie [23], namely that doubling cache size
`reduces miss rate by 25% was found to apply well to these
`measured values. Since we are also interested in performance
`evaluations for caches larger than 1 MB, Higbie's rule is used
`to extrapolate miss rates for 2 MB and 4 MB large caches.
`Fig. 6 shows some average miss rates for a direct-mapped and
`a two-way set-associative unified cache as a function of cache
`size.
`To investigate the dependence of memory hierarchy perfor-
`mance on miss rates, simulations are also performed for pro-
`grams exhibiting significantly higher miss rates. The floating-
`point-intensive programs swm256, a shallow water equation
`solver, and the tomcatv work load, a mesh-generator, were
`selected for this purpose. In Fig. 7, some of the measured
`m i s s rates of these programs are shown. As can be seen, miss
`rates are higher than thoste in Fig. 6. At a cache size of 1 MB
`miss rates are still above or in the range of 1%. The humped
`behavior of the curves is due to the nature of the programs.
`
`MICRON ET AL. EXHIBIT 1071
`Page 3 of 10
`
`

`
`712
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`+ Direct-Mapped
`
`I
`
`100 I
`
`- 10
`
`I
`
`3
`U PI
`2 1
`
`O ' - N d 00
`
`~0;1$"" r t $ g g
`r ? 2 m n , * *
`Cache Size [kB]
`Average miss rates for the complete SPEC92 benchmark suite for
`Fig. 6.
`unified caches with a block size of 32 B.
`
`100
`
`- 10
`
`3
`
`I
`
`OJ-
`
`P l z 2 3
`N d CO Z $ = l = f " "
`" 2
`Cache S u e [kB]
`Miss rates for the swm256 and the tomcatv program for unified
`Fig. 7.
`caches with a block size of 32 Bytes [22].
`
`B. Access Times
`Access times of on-chip caches are computed on the basis
`of the analytical equations derived by Wilton and Jouppi [24]
`which are an extension of the model proposed by Wada et
`al. 1251. Wilton and Jouppi compute the access time of a
`cache from the delays of datdtag decoder, datdtag word line,
`datdtag bit line, datdtag sense amplifier, comparator, valid
`signal driver, data output driver, precharge, and multiplexor
`in the case of a set-associative cache. (We will use the
`term access time throughout this work even though strictly
`speaking the cycle time, which is the minimum time between
`the start of two successive accesses, is being referred to).
`Equations are expressed as a function of cache size, block
`size, associativity, data output width, process parameters,
`and array configuration parameters such as the number of
`sections per word line or bit line. The array parameters
`are always optimized to yield minimum access time. We
`modified the model of Wilton and Jouppi by inserting cascaded
`drivers at several locations and by implementing a different
`decoding scheme to also suit it for the evaluation of large
`caches and caches with a large number of output bits. In
`addition, we scaled parameters to match a high performance
`0.5 ,um complementary metal-oxide-semiconductor (CMOS)
`technology. Fig. 8 shows calculated access times for a direct-
`mapped and a two-way set-associative cache as a function of
`
`2 /
`
`Fig. 8. Access time of a direct-mapped and a two-way set-associative cache
`as a function of its size. The block size is 16 B and the number of output
`bits is 64.
`
`cache size. Access times increase with cache size and are larger
`for the two-way set-associative cache due to the additional
`circuitry.
`Now let us consider off-chip cache accesses. It is assumed
`that these are pipelined into three stages. In the first stage,
`the address is sent from the processor to the SRAM's, in the
`second stage the actual SRAM access is performed, and in the
`third stage the data are sent from the SRAM's to the processor.
`The minimum time required for a pipeline stage is determined
`by the stage that exhibits the longest delay. Assuming an
`
`SRAM access time (TACCSRAM) of 7 ns, investigations for
`a specific PCB implementation [26] show that TACC~RAM
`is larger than the time required for sending the address and
`that for sending the data. Therefore, the minimum length of
`the pipeline stage is set by the access time of the SRAM
`TACCSRAM. When the off-chip second-level cache is accessed
`the first data is available after three pipeline intervals and
`successive data are available at the end of each consecutive
`pipeline interval.
`Main memory accesses are characterized by two time con-
`
`stants: TMMLAT and TFPCYC. TMMLAT is a latency that is
`followed by a period of accesses. The time required between
`successive accesses is TFPCYC and is primarily determined
`by the timing of the DRAM'S.
`
`C. Average Time Per Instruction
`The memory hierarchies investigated in this work contain
`two levels of cache. Fig. 9 shows a block dagram of the basic
`memory hierarchy organization. Both caches are inserted in
`series between processor and main memory. The first-level
`cache is split into data- and instruction-caches of equal size:
`
`S~c1/2. The access time T L C ~ is evaluated using the access
`time model for on-chip caches as a function of cache size,
`block size, and the number of output bits. Throughout this
`work it is assumed that the first-level cache access is pipelined
`into two stages and that the processor cycle time T C ~ C is
`determined by
`
`TLCI
`TCYC =
`(1)
`~ 2 .
`Therefore, a variation in size or block size of the first-level
`cache impacts the processor cycle time. The data bus-width
`
`MICRON ET AL. EXHIBIT 1071
`Page 4 of 10
`
`

`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`~
`
`713
`
`CPU
`
`I
`
`Split First-Level Cache &Cl)
`Inpbnctmtion. i Oil-Cbip, 2 Pipeline Stages
`i losnuclion: S,,~,Dut~l: %,!2
`Cache Size
`Oigwir*tiau:
`
`l a c k Size Btt,, Dircrl-R4appcd
`
`where MRI-Lc~ and MRD-Lc~ represent the miss rates of
`instruction and data cache, respectively. DMRPI stands for
`data memory references per instruction.
`On a second-level cache hit, a B L C l large block of data is
`read from the second-level cache and transferred to the first-
`level cache. The hit HTLc;! is calculated from (4), shown at the
`bottom of the page. This equation requires some elaboration.
`The SRAM access time T~~CCSRAM
`and the access time of the
`on-chip second-level cache T L C ~ are rounded to next higher
`integer multiples of the processor cycle time. For an off-chip
`second-level cache, the width of the data bus to the second-
`level cache, B W L C ~ ,
`is equal to B W O F F L ~ ~ which is fixed
`to 128 bits in our study. 'Therefore, the larger the block size
`BLCI of the first-level cache, the larger the hit time HTLc:!
`due to the successive accesses required to load a block. For
`an on-chip second-level cache the bus-width constraints are
`removed and it is assumed that the data bus-width BWLCS
`is selected to match the block size of the first-level cache.
`Consequently, only one second-level cache access is required
`to load a block into the lirst-level cache independent of the
`block size B L C ~ . This results in a substantial reduction of the
`hit time HTLc~ over that of an off-chip cache. We note that
`some of this decrease could also be achieved by pipelining the
`on-chip cache, however, widening the interface to the block
`size of the first-level cache yields the lowest hit time and is
`assumed throughout this study to demonstrate the potential of
`
`3-D technology. The term TLC~ in (4) is due to the time that
`between first and second-level cache is B W L C ~ . A write buffer
`is required to write the last portion of data in the first-level
`with a sufficient depth that it essentially never fills up, is
`cache. (All other writes are performed in parallel with read
`located between first- and second-level cache. The second-
`accesses of the second-level cache.)
`level cache is a unified cache and has a size SLC~,
`a block
`The penalty for a second-level cache miss is evaluated in a
`
`size B L C ~ , and an associativity A L C ~ . It is either implemented
`similar fashion using (3, shown at the bottom of the page.
`on-chip or off-chip. Second-level cache and main memory are
`Again, timing parameters are rounded to next higher integer
`connected by a BWMM bit wide data bus.
`multiples of the processor cycle time. For off-chip main
`The system is assumed to issue one instruction per clock
`memory the data bus-width to main memory, BWMM, is
`cycle, neglecting the effects of cache misses. Therefore, the
`selected to BWOFFMM which is 128 b. BLC~IBWOFFMM
`base CPI (cycles per instruction) is one. The ATPI is expressed
`successive accesses are required to load a block in the second-
`as
`level cache. In the case of on-chip main memory, the data
`bus-width BWMM is selected to match the block size of the
`ATPI = TLcl+ MRPI MRLc~HTLc~ + MRPI MRLC~MPLCZ
`second-level cache. Then only a single main memory access
`is required to load a block. from main memory independent of
`(2)
`In (5), TWRLCZ is the time to write the last BWMM
`B L C ~ .
`where HT stands for hit time, MR for miss rate, MP for
`
`data bits in the second-level cache. TWRLCZ is equal to
`m i s s penalty, and MRPI memory references per instruction.
`[TLcz/Tcyc] TCYC for an on-chip second-level cache and
`Since the first-level cache consists of split instruction- and
`equal to 3 rT'CCSRAM/TCyc]TCyC
`for an off-chip second-
`data caches, the miss rate MRLc~ is given by
`level cache.
`The model parameters used throughout this work are sum-
`marized in Table I.
`
`Main Memory
`Imptanentationf OR-Chip ne On-Chip
`
`Block diagram of the investigated memory hierarchy. (Bus-widths
`Fig. 9.
`refer to data bits only.)
`
`HTLC2 =
`
`E
`
`(2 + BWOy&cz ) [ T A g z M 1 Tcyc + T L C ~ , off-chip second-level cache
`
`a TCYC + TLCI,
`1 F C Y C 1
`on-chip seecond-level cache
`
`(4)
`
`MICRON ET AL. EXHIBIT 1071
`Page 5 of 10
`
`

`
`714
`
`IEEE TRANSACTIONS ON COMPONENTS, PACKAGING, AND MANUFACTURING TECHNOLOGY-PART B, VOL. 19, NO. 4, NOVEMBER 1996
`
`TABLE I
`MODEL PARAMETERS
`
`TACCSRAM
`
`TAMLAT
`TFPCYC
`
`7 ns
`50 ns
`30 ns
`
`I
`
`DMRF'I
`d e s s specified otherwise
`
`0.34
`
`We would like to note that in the presented memory
`hierarchy performance model, the impact of more complicated
`fetch strategies such as early continuation, load forwarding, or
`streaming, which allow the processor to continue before the
`entire block has been fetched, are neglected. Przybylski [27]
`found that the simplest fetch strategy, as used in the above
`model, works well and that the performance improvement
`of the more intricate ones is less than indicated by the
`accompanying reduction in m i s s ratios because of limited
`memory resources and temporal clustering.
`
`IV. CASE STUDIES
`
`A. Conventional Implementation
`Our investigations start with a conventionally implemented
`system. The first-level cache is direct-mapped and imple-
`mented on-chip, the second-level is also direct-mapped and
`realized off-chip. Using the performance model, the average
`time per instruction is evaluated as a function of first- and
`second-level cache sizes. Simulation results are shown in
`Fig. 10. The ATPI generally decreases with increasing size
`of the second-level cache due to the associated reduction in
`miss rates. The absolute improvement, however, diminishes
`with increasing second-level cache size, since the reduction of
`miss rate becomes smaller.
`The dependence of ATPI on first-level cache size is a little
`more complex. For small first-level caches the processor cycle
`trme is short, but due to the resulting high mss rate and the
`large m i s s penalties implied by the off-chip second-level cache
`access, the ATPI is relatively large at any second-level cache
`size. As the first-level cache size is increased, the lower miss
`rates more than compensate for the increase in processor cycle
`time, such that the ATPI decreases. As the first-level cache size
`is further increased a point is reached, where the reduction in
`miss rate is no longer able to compensate for the increase
`in first-level cache access time and the associated increase
`in processor cycle time. The level of the ATPI then starts
`to rise again. In Fig. 10 this point is reached at a first-level
`cache size of 128 kE3. Therefore, increasing the size of the
`first-level cache beyond this point degrades performance and
`
`Size of Second-Level Cache [kB]
`Performance of a conventionally implemented system with on-chip
`Fig. 10.
`first-level cache and off-chip second-level cache.
`
`consequently the use of 3-D technology to realize large on-chip
`first-level caches does not appear to offer any advantage. Note
`that this conclusion has been derived under the assumption of
`a first-level cache with two pipeline stages. The dependence
`of processor cycle time on cache access time can basically
`be reduced by further pipelining the cache access. Optimum
`first-level cache sizes are then larger. However, this also has
`negative effects such as the degradation of CPI due to load
`delays and branch delays and is beyond the scope of this work.
`
`B. Moving Second-Level Cache and Main Memory On-Chip
`If the second-level cache is moved on-chip its access time is
`reduced. In addition, the bus-width between first- and second-
`level cache may be selected equal to the block size of the
`first-level cache. Therefore, on a second-level cache hit only
`one second-level cache access is required to fill a first-level
`cache line instead of several successive accesses. Fig. 11
`shows performance results for an implementation with an on-
`chip direct-mapped second-level cache. The deviations from
`smooth behavior that can be seen in the figure are due to
`the rounding of timing parameters to integer multiples of the
`processor cycle time. In comparison with the conventional
`implementation investigated above, the ATPI is much less
`dependent on the size of the first-level cache. This is a result of
`the substantial reduction of the penalty associated with second-
`level cache accesses. Also, the individual curves are lower, that
`is, the ATPI is generally reduced. Furthermore, the optimum
`size of the first-level cache is 32 kE3 and therefore smaller than
`for the conventional implementation.
`Fig. 12 allows a direct quantitative comparison of the per-
`formance of four different implementations (two of which have
`so far not been discussed). In the figure, for every second-
`level cache size of each implementation, the first-level cache
`size was chosen to yield minimum (optimum) ATPI and only
`the corresponding ATPI value is shown. The performance
`improvement achieved when moving the second-level cache
`on-chip can be seen by comparing the two upper curves. At
`a second-level cache size of 1 MB the ATPI is reduced from
`4 ns to 3.5 ns (13%).
`Set associative caches generally exhibit lower miss rates
`than direct-mapped caches of the same size. A set-associative
`organization of an off-chip cache is usually associated with
`
`MICRON ET AL. EXHIBIT 1071
`Page 6 of 10
`
`

`
`KLEINER et al.: PERFORMANCE IMPROVEMENT OF THE MEMORY HIERARCHY OF RISC-SYSTEMS
`
`715
`
`-
`
`Y
`
`Total Size of First-Level Cache
`-2kB
`-32kB
`-+-512kB
`-4kB
`-1024kB
`-64kB
`-8kB
`-128kB
`t 2 0 4 8 k B
`-16kB
`-256kB
`
`Implementation Types:
`-t- LC2 Off-Chip DM (Cow.)
`-e- LC2 On-Chio Ass. 2 (3-D IC)
`
`
`~I
`
`I
`
`'
`
`,
`
`I
`
`I
`
`Size of Second-Level Cache [kB]
`Fig. 11. Performance of an implementation with a direct-mapped on-chip
`second-level cache realized using 3-D technology. (LC2 = second-level
`cache, DM = direct-mapped, Ass. 2 = two-way set-associative, and
`MM = main memory.)
`
`12 I
`
`Implementation Types:
`--C LC2 Off-Chi0 DM (Conv.)
`---ft U32 On-Chip DM (3-D IC)
`-+- U32 On-Chip Ass. 2 (3-D IC)
`---b LC2 On-Chip Ass. 2, MM on-chip (3-D IC)
`
`I
`
`I
`I
`
`
`
`Size of Second-Level Cache [kB]
`Fig. 12. Performance comparison of four implementations.
`
`a prohibitively large penalty in terms of increase in hit
`time in comparison with a direct-mapped cache. For an on-
`chip cache, the increase in hit time can be small enough to
`justify the choice of a set-associative cache. The performance
`improvement achievable by organizing the on-chip second-
`level cache as two-way set-associative can be seen from
`Fig. 12. Especially for the smaller second-level cache sizes,
`the two-way set-associative organization is clearly superior
`to the direct-mapped one. This results from the fact that the
`improvement in miss rate when moving from a direct-mapped
`to a set-associative cache organization is more pronounced for
`small caches than for large ones.
`Simulations were also performed for the case that the
`main memory is additionally integrated on-chip. The timing
`parameters of the main memory TMMLAT and TFPCYC were
`assumed to be unchanged by this modification. The bus-width
`to main memory, BWMM, was selected equal to the block size
`of the second-level cache and the main memory was presumed
`to be organized with the same width. Therefore, in contrast to
`the above implementations, only one main memory access is
`required to load a complete cache line from main memory
`into the second-level cache. Results for an implementation,
`where the second-level cache is organized as two-way set-
`associative are also shown in Fig. 12. The reduced penalty for
`main memory accesses leads to a further reduction of the ATPI
`
`s
`
`d
`ri
`m
`w
`C!
`Block Size of First-Level Cache [B]
`
`M
`
`W
`ICI
`(U
`
`Dependence of performance on the block size of the first-level cache
`Fig. 13.
`for two implementations with a 1 MB large second-level cache.
`
`particularly for the smaller second-level cache sizes since these
`are associated with more frequent main memory accesses.
`
`C. Dependence of Peiformance on Block Size
`It is a known fact that imiss rates decrease as block sizes
`increase until the so-called "pollution point" is reached [9]. In
`conventional implementations where bus-widths are limited,
`large block sizes are associated with large miss penalties
`due to the successive accesses required for loading a cache
`line. Since block sizes iiIe chosen to minimize ATPI and
`not misses, optimum bloclk sizes are smaller than desirable
`from a miss rate point of view. However, if the second-level
`cache is realized on-chip the block size of the first-level cache
`may be selected larger without an increase in miss penalty.
`Consequently, optimum block sizes are expected to be larger.
`Fig. 13 shows the ATPI as a function of the block size of
`the first-level cache for ari implementation with an off-chip
`second-level cache as well iis one with an on-chip second-level
`cache. In the former case the optimum block size is 32 B and
`the ATPI is degraded if the block size is increased beyond
`this point. For the implemlentation with on-chip second-level
`cache, the strong deterioration of ATPI at the larger block
`sizes does not occur and optimum performance is achieved at
`a block size of 128 B.
`
`D. Dependence of Pe$ormiance Improvement on Miss Rates
`In Fig. 14, the performance of the four types of implemen-
`tations discussed above is shown as a function of second-level
`cache size for the floating -point-intensive programs swm256
`and tomcatv. The respective performances for the average
`miss rates of the complele SPEC92 benchmark suite were
`shown in Fig. 12. Especiallly for large cache sizes these two
`programs exhibit miss rates substantially larger than those of
`the complete benchmark suite (see Figs. 6 and 7).
`Let us first consider the swm256 work load. Moving the
`direct-mapped second-level cache on-chip yields a reduction
`of ATPI in the range of 10-15% for most cache sizes.
`Organizing this cache two-way set-associative does not lead to
`any considerable further irnprovement because miss rates are
`very similar for the two types of organizations (cf. Fig. 7).
`Howe

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket