throbber

`
`W1,
`
`Jerusalem, Israel
`
`The 16th Annual
`International Symposium on
`
`COMPUTER
`ARCHITECTURE
`
`xflmxmmsxwymfiwmmkzmmmfiwwwmwmw mWWW;mm“;
`
`melex»
`
`
`
`
`
`w9m»
`
`.
`Q The Institute of Electrical and Electronics, Engineers, Inc.
`
`‘
`
`COMPUTER
`33555er
`.
`
`1
`
`‘ UNIFIED 1005
`
`UNIFIED 1005
`
`1
`
`

`

`The16thAnnUal
`
`
`International Symposium on
`
`
`
`
`
`
`
`
`2
`
`

`

` [3.31 t_,- j
`
`
`
`The 16th Annual
`International Symposium on
`
`COMPUTER
`ARCHITECTURE
`
`
`
`u¥
`
`’__————‘
`
`3
`
`

`

`“up“-..N“...
`g‘a‘wmugfigummmxwmmmumm-“1.1.,“ to
`-
`m ...__ ‘,3:3‘“0—‘\uvv~'-,---Vgfi~m N
`We
`. _,
`__
`..
`._ but...“ ”My”...
`
`
`
`
`‘
`
`1"
`
`‘
`'
`
`N
`
`1;
`‘ WW“,
`
`Published by
`
`é) fir?”
`
`Q... if“:
`
`E {J
`
`IEEE Computer Society Press
`1730 Massachusetts Avenue, NW.
`Washington, DC. 20036—1903
`
`I
`
`P
`
`Cover design by Jack I. Ballestero
`
`Copyright © 1989 by Association for Computing Machinery, Inc.
`
`Copying without fee is permitted provided that the copies are not made or distributed for direct
`commercial advantage and credit to the source is given. Abstracting with credit is permitted.
`For other copying of articles that carry a code at the bottom of the first page, copying is per-
`mitted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 29 Congress Street, Salem, MA 01970. For permission to republish, write to:
`Director of Publications, Association for Computing Machinery, 11 West 42nd Street, New York,
`NY 10036.
`
`IEEE Computer Society Order Number 1948
`Library of Congress Number 85-642899
`IEEE Catalog Number 89CH2705-2
`ISBN 0-8186-1948-1 (paper)
`ISBN 0-8186—5948-3 (microfiche)
`ISBN 0-8186-8948-X (case)
`ACM Order Number 415890
`ISBN 0-89791-319-1
`ISSN 0884—7495
`
`Additional copies of the 1989 Proceedings may be ordered from:
`
`IEEE Service Center
`IEEE “mpu‘e' S°°Iety
`Order 33mm t
`445 Hoes Lane
`0“” “Fame“,
`P o 30,36,113
`PO. Box 1331
`10662 9’5 vaquems CW6
`Bairi'mo MD
`re,
`21264 Los Alamltos, CA 90720-2578 Piscataway, NJ 08855-1331
`
`IEEE Computer Society
`13, Avenue de l'Aquilon
`3-1200 Brussels
`BELGIUM
`
`IEEE Computer Society
`Ooshima Building
`2-191 Minami-Aoyama
`Minato-ku, Tokyo 107, JAPAN
`
`Q THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS. INC.
`
`iv
`
`Pf_-.5,
`K.Mg.
`
`9 a
`
`<9
`
`.-.._e,-...2.9___i “0'46,
`.1,__¢.._.Q~
`
`
`
`
`3 “
`
`43’
`
`
`4
`
`

`

`
`
`The 16th Annual International Symposium on Computer Architecture
`
`Table of Contents
`
`Cache Coherence and Synchronization I
`Chair: M. Dubois, USC
`
`Evaluating the Performance of Four Snooping Cache Coherency Protocols ................ 2
`SJ. Eggem and RH. Katz
`Multi-level Shared Caching Techniques for Scalability in VMP-MC ................... 16
`DR. Cheriton, HA. Goosen, and RD. Boyle
`Design and Performance of a Coherent Cache for Parallel Logic Programming
`Architectures .................................................. 25
`A. Goto, A. Matsumoto, and E. Tick
`
`Dataflow
`
`Chair: 1. Koren, UMASS Amherst
`
`The Epsilon Dataflow Processor ....................................... 36
`VG. Grafe, GS. Davidson, J.E. Hoch, and VP. Holmes
`An Architecture of a Dataflow Single Chip Processor ........................... 46
`S. Sakai, Y. Yamaguchi, K Hiraki, Y. Kodama, and T. Yuba
`,
`Exploiting Data Parallelism in Signal Processing on a Dataflow Machine ................ 54
`P. Nitezki
`
`Pipeline Architectures
`Chair: A. Gottlieb, NYU
`
`Architectural Mechanisms to Support Sparse Vector Processing ..................... 64
`RN. Ibbett, T.M. Hopkins, amd KI.M. McKinnon
`’
`A Dynamic Storage Scheme for Conflict-Free Vector Access ...................... 72
`D. T. Harper III and DA. Linebarger
`‘
`SIMP (Single Instruction stream/Multiple instruction Pipelining): A Novel High—Speed
`Single-Processor Architecture ........................................ 78
`K Murakami, N. Irie, M. Kuga, and S. Tomita
`
`Mapping Algorithms
`Chair: A.L. Davis, HP
`
`2-D SIMD Algorithms in the Perfect Shuffle Networks .......................... 88
`Y. Ben-Asher, D. Egozi, and A. Schuster
`Systematic Hardware Adaptation of Systolic Algorithms ......................... 96
`M Valera-Garcia, IJ. Navarro, J.M. Llaberia, and M. Valera
`Task Migration'in Hypercube Multiprocessors .............................. 105
`M.-S. Chen and K G. Shin
`
`Uniprocessor Caches
`Chair: D. Alpert, INTEL
`
`Characteristics of Performance-Optimal Multi—Level Cache Hierarchies ............... 114
`S. Przybylski, M Horowitz, and J. Hennessy
`,
`Supporting Reference and Dirty Bits in SPUR’s Virtual Address Cache ............... 122
`DA. Wood and RH. Katz
`
`Inexpensive Implementations of Set—Associativity ............................ 131
`RE. Kessler, R. 10035, A. Lebeck, and MD. Hill
`Organization and Performance of a Two-Level Virtual—Real Cache Hierarchy ............ 140
`W.-H. Wang J.-L. Baer, and HM. Levy
`
`XV
`
`
`
`
`
`5
`
`

`

`
`
`.
`.
`Networks
`Chair: H.J. Siegel, Purdue Umversrty
`High Performance Communications in Processor Networks ----------------------
`CR. Jesshope, P.R. Miller, and IT. Yantchev
`.
`.
`Introducing Memory into the Switch Elements of Multiprocessor Interconnection Networks .
`H.E. Mizrahi, J.-L. Baer, E.D. Lazowska, and]. Zahoq'an
`Using Feedback to Control Tree Saturation in Multistage Interconnection Networks ........ 167
`S.L. Scott and GS. Sohi
`.
`.
`.
`.
`Constructing Replicated Systems with Point-to-Pomt Communication Links ............. 177
`P.D.Ezhi1che1van, SK Shn'vastava, andA. Tully
`
`.
`
`.
`
`. 158
`
`150
`
`Prolog Architectures
`Chair: R. Ginosar, Technion
`KCM:AKnowledgeCrunchingMachine
`H. Benker, J.M. Beacco, S. Bescos, M. Dorochevsky, 7h. Jeflré, A. Péhlmann, J. Noye,
`B. Poten'e, A. Sexton, J.C. Syre, 0. Thibault, and G.'Watzlawik
`A High Performance Prolog Processor with Multiple Function Units ................. 195
`A. Singhal and Y.N. Patt
`Evaluation of Memory System for Integrated Prolog Processor IPP .................. 203
`M. Morioka, S. Yamaguchi, and T. Bandoh
`A Type Driven Hardware Engine for Prolog Clause Retrieval over a Large Knowledge Base .
`K—F. Wong and MH. Williams
`'
`
`.
`
`.
`
`. 211
`
`...... 186
`
`Instruction Fetching
`Chair: 1. Spillinger, Technion
`
`Comparing Software and Hardware Schemes for Reducing the Cost of Branches .......... 224
`W.W. ku, T.M. Conte, and RP. Chang
`Improving Performance of Small On-Chip Instruction Caches ..................... 234
`’
`MK. Farrens and AR. Pleszkun
`
`Achieving High Instruction Cache Performance with an Optimizing Compiler ............ 242
`WW. HM and RP. Chang
`The Impact of Code Density on Instruction Cache Performance .................... 252
`P. Steenkiste
`
`Parallel Architectures
`
`Chair: H. Miihlenbein, GMD
`
`
`
`
`
`Can Dataflow Subsume von Neumann Computing? ........................... 262
`RS. Nikki] and Arvind
`
`Exploring the Benefits of Multiple Hardware Contexts in a Multiprocessor Architecture:
`Preliminary Results ............................................. 273
`W.-D. Weber andA. Gupta
`Architectural and Organizational Tradeoffs in the Design of the MultiTitan CPU .......... 281
`MP. Jouppi
`Run-Time Checking in Lisp by Integrating Memory Addressing and Range Checking ........ 290
`M. Sato, S. Ichikawa, and E. Goto
`
`
`
`Perfomance Evaluation
`
`Chair: D. Siewiorek, CMU
`
`Multiple vs. Wide Shared Bus Multiprocessors ........
`...................
`A. Hopper; A. Jones, and D. Lioupis
`Performance Measurements on a Commercial Multiprocessor Running Parallel Code ....... 307
`M. Annaratone and R. Riihl
`Interprocessor Communication S
`Processors
`peed and Performance in Distributed-Memory Parallel
`............... ......
`-----------------------------
`
`300
`
`xvi
`
`Iw‘
`
`
`
`irrrrzicj_-i1o2..
`0'lI,a
`
`a ’
`
`.
`
` 4a.,
`
`6
`
`

`

`
`
`Analysis of Computation-Communication Issues in Dynamic Dataflow Architectures ........ 325
`D. Ghosal, S.K Tn'pathi, L.N. Bhuyan, and H. Jiang
`
`Logic Simulation Systems
`Chair: E. Kronstadt, IBM
`
`Logic Simulation on Massively Parallel Architectures .......................... 336
`S. Kravitz, R.E. Bryant, and R. Rutenbar
`R256: A Research Parallel Processor for Scientific Computation ................... 344
`T. Fukazawa, T. Kimura, M. Tomizawa, K Takeda, and Y. Itoh
`
`Special Purpose Architectures
`Chair: S. Ruhman, Weizmann Institute
`
`A Three-Port/Three-Access Register File for Concurrent Processing and I/O
`Communication in a RISC—Like Graphics Engine ............................ 354
`ML. Anido, DJ. Allermn, and EJ. Zaluska
`An Architecture Framework for Application-Specific and Scalable Architectures .......... 362
`J.M. Mulder, RJ. Farrier, A. Srivastava, and R. in ’t Velt
`
`Memory Systems
`Chair: G. Frieder, Syracuse University
`
`Perfect Latin Squares and Parallel Array Access ............................. 372
`K Kim and VIC Prasarma Kumar
`
`An Aperiodic Storage Scheme to Reduce Memory Conflicts in Vector Processors ......... 380
`S. Weiss
`
`Analysis of Vector Access Performance on Skewed Interleaved Memory ............... 387
`C.-L. Chen and (1-K Liao
`
`Cache Coherence and Synchronization II
`Chair: J. Goodman, University of Wisconsin, Madison
`
`Adaptive Backoff Synchronization Techniques .............................. 396
`A. Agarwal and M. Cherian
`‘
`A Cache Consistency Protocol for Multiprocessors with Multistage Networks ............ 407
`P. Stenstrom
`
`On Data Synchronizations for Multiprocessors .............................. 416
`H.-M. Su and P.-C. Yew
`
`Author Index ................................................. 425
`
`xvii
`
`
`
`7
`
`

`

`
`
`Inexpensive Implementations of Set-Associativity
`
`R. E. Kessler’, Richard Jooss, Alvin Lebeck and Mark D. Hrlri
`
`University of Wisconsin
`Computer Sciences Department
`Madison, Wisconsin 53706
`
`ABSTRACT
`
`set-
`implementing wide
`approach to
`traditional
`The
`associativity is expensive. requiring a wide tag memory (directory)
`and many comparators. Here we examine altemative implementa-
`tions of associativity that use hardware similar to that used to
`implement a direct-mapped cache. One approach scans tags seri-
`ally from most-recently used to least-recently used. Another uses a
`partial compare of a few bits from each tag to reduce the number of
`tags that must be examined serially. The drawback of both
`approaches is that they increase cache access time by a factor of
`two or more over
`the
`traditional
`implementation of
`set-
`associativity, making them inappropriate for cache designs in
`which a fast access time is crucial (e.g. level one caches, caches
`directly servicing processor requests).
`These schemes are useful, however, if (i) the low miss ratio
`of wide set-associative caches is desired, (2) the low cost of a
`direct-mapped implementation is preferred, and (3) the slower
`access time of these approaches can be tolerated. We expect these
`conditions to be true for caches in multiprocessors designed to
`reduce memory interconnection traffic, caches implemented with
`large, narrow memory chips, and level two (or higher) caches in a
`cache hierarchy.
`
`1. Introduction
`
`The selection of associativity has significant impact on cache
`performance and cost [Smit86] [Smit82] [Hill87] [Przy88a]. The
`associativity (degree of associativity, set size) of a cache is the
`number of places (block frames) in the cache where a block may
`reside. Increasing associativity reduces the probability that a block
`is not foundtin the cache (the miss ratio) by decreasing the chance
`that recently referenced blocks map to the same place [Smit78].
`However, increased associativity may nonetheless result in longer
`effective access times since it can increase the latency to retrieve
`data on a cache hit [Hi1188,Przy88a]. When it
`is important to
`minimize hit times direct-mapped (associativity of one) caches
`
`I This work has been supported by graduate fellowships from the National Sci~
`ence Foundation and the University of Wisconsin-Madison.
`* This work was sponsored in part by research initiation grants from the graduate
`school of the University of Wisconsin-Madison.
`
`Permission to copy without fee all or part of this material is granted
`provided that the copies are not made or distributed for direct commer-
`cial advantage, the ACM copyright notice and the title ‘of the publication
`and its date appear, and notice is given that copying is by permission of
`the Association for Computing Machinery. To copy othemise, or to
`republish, requires a fee and/or specific permission.
`
`© 1989 ACM 0884-7495/89/0000/0131$O1.50
`
`I31
`
`may be preferred over caches with higher associativity.
`Wide associativity is important when: (I) miss times are very
`long or (2) memory and memory interconnect contention delay is
`significant or sensitive to cache miss ratio. These points are likely
`to be true for shared memory multiprocessors. Multiprocessor
`caches typically service misses via a multistage interconnect or
`bus. When a multi-stage interconnect is used the miss latency can
`be large whether or not contention exists. Bus miss times with low
`utilizations may be small, but delays due to contention among pro-
`cessors can become large and are sensitive to cache miss ratio. As
`the cost of a miss increases, the reduced miss ratio of wider associ-
`ativity will result in better performance when compared to direct-
`mapped caches.
`Associativity is even more useful for level two caches in a
`two-level multiprocessor cache hierarchy. While the level one
`cache must service references from the processor at the speed of
`the processor, the level two cache can be slower since it services
`only processor references that miss in the level one cache. The
`additional hit time delay incurred by associativity in the level two
`cache is not as important
`[Pny88b]. Reducing memory and
`memory interconnect traffic is a larger concern. Wide associativity
`also simplifies the maintenance of multi-level inclusion [Baer88].
`This is the property that all data contained in lower level caches is
`contained in their corresponding higher level caches. Multi-level
`inclusion is useful for reducing coherency invalidations to level
`one caches. Finally, preliminary models indicate that increasing
`associativity reduces the average number of empty cache block
`frames when coherency invalidations are frequentl, This implies
`that wider associativity will result in better utilization of the cache.
`Unfortunately, increasing associativity is likely to increase
`the board area and cost of the cache relative to a direct—mapped
`cache. Traditional
`implementations of a—way set-associative
`caches read and compare all a tags of a set in parallel to determine
`where (and whether) a given block resides in the cache. With t-bit
`tags, this requires a tag memory that can provide a X t bits in paral-
`lel. A direct-mapped cache can use fewer, narrower, deeper chips
`since it requires only a t-bit wide tag memory. Traditional imple—
`mentations of associativity also use a comparators (each t—bits
`wide) rather than one, wider data memory, more buffers, and more
`multiplexors as compared to a direct-mapped cache. This adds to
`the board area needed for wider associativity. As the size of
`memory chips increases,
`it becomes more expensive to consume
`board area with multiplexors and other logic since the same area
`could hold more cache memory.
`While numerous papers have examined associativity [Lipt68]
`[Kapl73] [Bell74] [Stre76] [Smit78] [Smit82] [Clar83] [Agar88],
`most have assumed the traditional implementation. One of the few
`papers describing a cache with a non-traditional implementation of
`l A miss to a set-associative cache can fill any empty block frame in the set, whereas, I miss
`to a direct-mawed cache can fill only a single frame. Increasing associativity increases the
`chance that an invalidated block frame will be quickly used again by making more empty
`frames available for reuse on a miss.
`
`
`
`
`
`8
`
`

`

`
`
` a-l
`
`
`
`(b) Serial (Using Naive Approach)
`
`Figure 1. Implementing Set-Associativity.
`
`Part (a) of this figure (top) shows the traditional implementation of
`the logic to determine hit/miss in an a-way set—associative cache.
`This logic uses the “SET” field of the reference to select one t-bit
`tag from each of a banks. Each stored tag is compared to the incom-
`ing tag (“TAG"). A hit is declared if a stored tag matches the in-
`coming tag, a miss otherwise.
`Part (b) (bottom) shows a serial implementation of the same cache ar-
`chitecture. Here the a stored tags in a set are read from one bank and
`compared serially (the tags are addressed with "SET" concatenated
`with 0 through a — 1).
`
`
`It discusses a cache implemented for a
`associativity is [Chan87].
`System/370 CPU that has a one-cycle hit
`time to the most-
`recently-used (MRU) block in each set and a longer access time for
`other blocks in the set, similar to the Cray-1 instruction buffers
`[Cray76] and the biased set-associative translation buffer described
`in [Alex86].
`
`This paper is about lower cost implementations of associa-
`tivity, implementations other than the traditional. We introduce
`cache designs which combine the lower miss ratio of associativity
`and the lower cost of direct-mapped caches. In the new implemen-
`tations the width of the comparison circuitry and tag memory is t,
`the width of one tag, instead of the a X: required by the traditional
`implementation.
`Implementations using tag widths of b Xt
`(1 < b < a) are possible and can result in intermediate costs and
`performance, but are not considered here. This paper is not about
`level two caches per se, but we expect these low cost schemes to be
`applicable to level two caches. We organize this paper as follows.
`Section 2 introduces the new approaches to implementing associa—
`tivity, shows how they cost less than the traditional implementation
`of associativity, and predicts how they will perform. Section 3
`analyzes the approaches of Section 2 with trace-driven simulation.
`
`2. Alternative Approaches to Implementing Set-Associativity
`Let a, a power of two, be a cache’s associativity and let t be
`the number of bits in each address tag. During a cache reference,
`an implementation must determine whether any of the a stored
`tags in the set of a reference match the incoming tag. Since at most
`one stored tag can match, the search can be terminated when a
`match is found (a cache hit). Alla stored tags, however, must be
`examined on a cache miss.
`
`Figure 1a illustrates the traditional implementation of the tag
`memory and comparators for an a-way set-associative cache,
`which reads and probes all tags in parallel. We define a probe as a
`comparison of the incoming tag and the tag memory. If any one of
`the stored tags match, a hit is declared. We concentrate only on
`cache tag memory and comparators, because they are what we pro-
`pose to implement differently. Additional memory (not shown) is
`required by any implementation of associativity with a cache
`replacement policy other than random. A direct-mapped cache
`does not require this memory. The memory for the cache data (also
`not shown)
`is traditionally accessed in parallel with the tag
`memory.
`
`Figure 1b shows a naive way to do an inexpensive set-
`associative lookup.
`It uses hardware similar to a direct-mapped
`cache, but serially accesses the stored tags of a set until a match is
`found (a hit) or the tags of the set are exhausted (a miss). Note how
`it requires only a single comparator and a r-bit wide tag memory,
`whereas, the traditional implementation requires 1 comparators and
`an a xt wide tag memory.
`Unfortunately, the naive approach is slow in comparison to
`the traditional implementation. For hits, each stored tag is equally
`likely to hold the data. Half the non-matching tags are examined
`before finding the tag that matches, making the average number of
`probes (a—1)/2 + 1. For a miss, all a stored tags must be examined
`in series, resulting in a probes. The traditional implementation
`requires only a single probe in both cases.
`
`2.1. The MRU Approach
`The average number of probes needed for a hit may be
`reduced from that needed by the naive approach by ordering the
`stored tags so that the tags most likely to match are examined first.
`One proposed order [8088] [Matt70] is from most-recently-used
`(MRU) to least—recently-used (LRU). This order is effective for
`level one caches because of the temporal locality of processor
`reference streams [SO88] [Chan87]. We find (in Section 3) that it is
`also effective for level two caches due to the temporal locality in
`streams of level one cache misses.
`
`One way to enforce an MRU comparison order is to swap
`blocks to keep the most—recently—used block in block frame 0. the
`second most-recently—used block in block frame 1, etc. Since tags
`(and data) would have to be swapped between consecutive cache
`accesses in order to maintain the MRU order. this is not a viable
`implementation option for most set-associative caches.2 A better
`way to manage an MRU comparison order, illustrated in Figure 2a.
`is to store information for each set indicating its ordering. For-
`tunately, information similar to a MRU list per set is likely to be
`maintained anyway in a set—associative cache implementing a true
`LRU replacement policy.
`In this case there is no extra memory
`requirement to store the MRU information. We will also analyze
`(in section 3) reducing the length of the MRU list, using approxi-
`mate rather than full MRU searches, to further decrease memory
`requirements. Unfortunately, the lookup of MRU information must
`precede the probes of the tags3, This will lead to longer cache
`lookup times than would the swapping scheme.
`If we assume that the initial MRU list lookup takes about the
`same time as one probe. the average number of probes required on
`aacache lookup resulting in a hit using the MRU approach rs 1+
`i=1
`Zifr where f,-
`is the probability the i-th MRU tag matches.
`
`given that one of them will match4. The MRU scheme performs
`
`particularly poorly on cache misses, requiring 1 +a probes. This _
`
`2 While maintaining MRU order using swapping may be feasible for a 2-way
`set-associative cache, Agarwal's hash-rehash cache [Agar87] can be supefi“ to
`MRU in this Z-way case.
`3 While it is POSSible to lookup the MRU information in parallel with the level-
`one-cache access, it is also possible to start level-two-cache accesses early f“
`any of the other implementation approaches [Bren84].
`4 Each f" is equal to the probability of a reference to LRU distance i divided by
`the hit ratio, for a given number of sets [Smit78].
`
`
`
`UL
`
` l i
`
`L"
`
`9
`
`

`

`
`
`
`
`MRU INFO
`
`SET
`
`sar + MRU]
`sa’r + MRU2
`
`MRUIMRUZ MRU:
`
`SET + MRU:
`
`
`
`- ..1 SET+ all
`
`(a) Using MRU Order
`
`7
`
`SET+
`<see below>
`
`? SET + PM]
`SET+PM2
`.
`
`o
`
`I
`
`I-1
`
`(b) Using Partial Compares
`
`Figure 2. Improved Implementations of Serial Set-Associativity.
`Part (a) of this figure (top) shows an implementation of serial set-
`associativity usin ordering information. This approach first reads
`MRU ordering in ormation (left) and then probes the stored tags from
`the one most-likely to match to the one least-likely to match (right).
`Note “+” represents concatenate.
`'
`
`Part (b) (bottom) shows an implementation of serial set-associativity
`using partial compares. This approach first reads k (k =|_ t/aj ) bits
`from each stored tag and compares them with the corresponding bits
`of the incoming tag. The second step of this approach serially oom-
`pares all stored tags that partially matched (“PM") with the incom-
`ing tag until a match is found or the tags are exhausted (right).
`
`
`is one more than the naive implementation on misses since the
`MRU list is uselessly consulted.
`
`2.2. The Partial Compare Approach
`
`We have carefully defined a probe to be the comparison of
`the incoming tag and the tag memory, without requiring that all
`bits of the tag memory come from the same stored tag. We now
`introduce the partial compare approach that uses a two step pro-
`cess to often avoid reading all t bits of each stored tag. In step one,
`the partial compare approach reads t/a bits from each of astored
`tags and compares them with the corresponding bits of the incom-
`ing tag. Tags that fail this partial comparison carmot hit and need
`not be examined further on a cache lookup.
`In step two, all stored
`tags that passed step one are examined serially wrth t-bit (full)
`compares.
`
`The implementation of partial compares is not costly, as it
`can use the same memory and comparators as the naive approach
`assuming k, the partial compare width (k =L t/aJ ), is a multiple
`of memory chip and comparator width. Partial compares are done
`with the help of a few tricks. The first trick, illustrated in_Frgure
`2b, is to provide slightly different addresses to each. k—brt wrde
`collection of memory chips, addressing the i-th collection with the
`address of the set concatenated with logzi . The second trick rs tso
`
`divide the t-bit comparator into a separate k-bit comparators .
`
`thenL t/aj xa bits of the tag can be used for partial
`5 If kxa does not equal 1
`00mpares, with another comparator for the extra bits.
`
`1O
`
`l33
`
`This is straight-forward, since wide comparators are often imple-
`mented by logically AND-ing the results of narrow comparators.
`Note how step two of this partial compare approach uses the same
`tag memory and comparators as step one, but does full tag com-
`pares rather than partial compares.
`The performance of this approach depends on how well the
`partial compares eliminate stored tags from further consideration.
`For independent tags, the average number of probes will be minim-
`ized if each of the values [0. 2" — l] is equally likely for each of the
`k-bit patterns on which partial compares are done. While this con-
`dition may be true for physical address tags, it is unlikely to be true
`for the high order tag bits of virtual addresses. Nevertheless, we
`can use the randomness of the lower bits of the virtual address tag
`to make the distribution of the higher ones more uniform and
`independent. For example, one can transform a tag to a unique
`other tag by exclusive-oring the low-order k bits of the tag with
`each of the other k -bit pieces of the tag before it is stored in the tag
`memory.
`Incoming tags will go through the same transformation
`so that the incoming tag and the stored tag will match if the
`untransformed tags are the same. The original tags can be retrieved
`from the tag memory for writing back blocks on replacement via
`the same transformation in which they were stored (i.e.
`the
`transformation is its own inverse). This method is used throughout
`this paper to produce stored tags with better probabilistic chame-
`ten'stics. We will also analyze using no transformation. and using a
`more sophisticated one in Section 3. We make the assumption in
`our analysis to follow that each of the values [0, 2" — l] is equally
`likely and independent for each partial compare. Our trace-driven
`simulation (in Section 3) tests this assumption
`The probability that an incoming tag partially-matches a
`stored tag is 1/2". A false match is a partial tag match which will
`not lead to a match of the full tag. Given a hit,
`the expected
`number of false matches in step one is (a—1)/2", of which half
`will be examined in step two before a hit is determined. Thus. the
`expected number of probes on a hit is 1+ (a—1)/2"“ + 1, where
`the terms of the expression are: the probe for the partial comparison
`(step one), the full tag comparisons (in step two) due to false
`matches, and the full tag match which produces the hit, respec-
`tively. On a miss,
`the expected number of probes in simply
`1+ a /2", the probe for the partial comparison and the number of
`false matches, respectively.
`The partial compare scheme can lead to poor performance if
`many false matches are encountered in step two. Wider partial
`compares could eliminate some of these false matches. The partial
`compare width can be increased by partitioning the a stored tags of
`a set into s proper subsets (each containing a/s tags) and examin-
`ing the subsets in seriesfi. The step one and step two partial com-
`pare sequence is performed for each of the subsets to determine if
`there is a cache hit. The order in which the subsets are examined is
`arbitrary throughout this paper.
`Increasing the number of subsets
`will increase the partial compare width since fewer partial com-
`pares are done concurrently. For example, 2 subsets could be used
`in an 8-way set—associative cache, with 4 entries in each. A lockup
`in this cache would proceed as two 4-way (single subset) lookups,
`one after the other. With a 16-bit wide tag memory in this cache,
`partitioning into 2 subsets would result in 4-bit partial compares.
`This will result in fewer false matches than with the 2-bit partial
`compares without subsets. The number of probes per access
`decreases when using proper subsets if the expected number of
`false matches is reduced (due to wider partial compares) by more
`than the number of probes added due to the additional subsets.
`Subsets may be desirable for implementation considerations in
`addition to performance considerations if the memory chip or com-
`parator width dictate that the partial compares be wider.
`At one extreme (where s = a), partial compares with subsets
`would be implemented as the naive approach, while the other
`
`6 Note that subsets are not useful with the naive and MRU approaches.
`
`
`
`
`
`10
`
`

`

`gives at least four-bit partial compares will work well.
`Table 1 summarizes our analysis of the expected number of
`probes required for the traditional, naive, MRU and partial compare
`approaches to implementing set-associativity. Note that this table
`as well as most of the trace-driven simulation assumes 16 bit tags
`are used. We will examine the positive effect of increasing the tag
`width on the partial compare approach in section 3.
`
`Table 2 summarizes paper implementations of tag memory
`and comparison logic for a direct-mapped cache, a traditional
`implementation of set-associativity, and an implementation of set-
`associativity using MRU and partial compares. We found that the
`MRU and partial compare implementations have a slower access
`time than the traditional
`implementation of associativity but
`includes no implementation surprises. Most notably, the control
`logic was found to be of reasonable complexity. The MRU and
`panial compare implementations use hardware similar to a direct-
`mapped cache and can make effective use of page-mode dynamic
`RAMs, as would other serial implementations of set-associativity.
`Page-mode dynamic RAMs are those in which the access time of
`multiple probes to the same set can be significantly less than if the
`probes were to other sets. Subsequent probes take less than half the
`time of the first probe to the set. Cache cost is reduced in two ways
`when using one of the alternative implementations of associativity.
`First, tag memory cost is directly reduced, by 1/3 to 1/2 in our
`design. Second, cache data memory cost is reduced since only 1,
`rather than a words, need to be read at a time.
`
`max(l, a xk)
`1.25
`2+ (a-r)/2H1 1 + a 12"
`
`15
`2.09
`-
`‘
`
`
`PM“
`2 + (1/2) (xvi)
`2,
`
`WISUbsets
`S
`max(t, a/S Xk)
`+ (a_1)/2k+l
`S + a/
`
`2.88
`3.00
`(k = 2 bits)
`l
`16
`8
`(k =4 bits)
`16
`2.72
`2.5
`
`
`Table 1. Performance of Set-Associativity Implementations.
`
`For various methods and associativities this table gives the number of
`subsets, the tag memory width, the number of probes for a hit, and,
`the number for a miss. The table assumes t-bit tags (I =16), k~bit
`partial compares, and that the i-th most—recently used tag matches
`with probability f,- on a hit. Note how an increase from l to 2 subsets
`improved the predicted performance of the partial compare approach
`at an associativity of 8.
`
`‘
`
`(s = 1) can lead to many false matches. An important question to
`ask is: what number of subsets leads to the best performance (i.e.
`fewest number of probes per cache lookup) ? The next three
`answers to this question vary from the most-accurate to the most
`succinct.
`(1) One can compute the expected number of probes for
`each of 5 =1, 2.4,
`, a/2 and a using the equations for a hit and
`miss (from Table 1) weighted to reflect your expected miss ratio
`and choose the minimum.
`(2) One can ignore misses (which are
`less common and never require more than twice the probes of hits),
`assume variables are continuous, and find the optimum partial com-
`pare width, k0,” = log: — 1/2 for hits only. The optimum number
`of subsets for hits and misses together is likely to be the value for .9
`resulting from a partial compare width of L km] or [ kart] .
`(3)
`Finally, one can observe that many tags in current caches are
`between 16 and 32 bits wide, implying the number of subsets that
`
`3. Trace-Driven Performance Comparison
`This
`section analyzes
`the performance of the low-cost
`schemes to implement set—associz.tivity in level two caches using
`simulation with relatively short multiprogramming traces. We
`analyze associativity in the level two cache since the low cost
`implementations of associativity are more appropriate for level two
`(or higher) caches than for level one caches. We concentrate on
`presenting and characterizing the relative performance of

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket