`
`W1,
`
`Jerusalem, Israel
`
`The 16th Annual
`International Symposium on
`
`COMPUTER
`ARCHITECTURE
`
`xflmxmmsxwymfiwmmkzmmmfiwwwmwmw mWWW;mm“;
`
`melex»
`
`
`
`
`
`w9m»
`
`.
`Q The Institute of Electrical and Electronics, Engineers, Inc.
`
`‘
`
`COMPUTER
`33555er
`.
`
`1
`
`‘ UNIFIED 1005
`
`UNIFIED 1005
`
`1
`
`
`
`The16thAnnUal
`
`
`International Symposium on
`
`
`
`
`
`
`
`
`2
`
`
`
` [3.31 t_,- j
`
`
`
`The 16th Annual
`International Symposium on
`
`COMPUTER
`ARCHITECTURE
`
`
`
`u¥
`
`’__————‘
`
`3
`
`
`
`“up“-..N“...
`g‘a‘wmugfigummmxwmmmumm-“1.1.,“ to
`-
`m ...__ ‘,3:3‘“0—‘\uvv~'-,---Vgfi~m N
`We
`. _,
`__
`..
`._ but...“ ”My”...
`
`
`
`
`‘
`
`1"
`
`‘
`'
`
`N
`
`1;
`‘ WW“,
`
`Published by
`
`é) fir?”
`
`Q... if“:
`
`E {J
`
`IEEE Computer Society Press
`1730 Massachusetts Avenue, NW.
`Washington, DC. 20036—1903
`
`I
`
`P
`
`Cover design by Jack I. Ballestero
`
`Copyright © 1989 by Association for Computing Machinery, Inc.
`
`Copying without fee is permitted provided that the copies are not made or distributed for direct
`commercial advantage and credit to the source is given. Abstracting with credit is permitted.
`For other copying of articles that carry a code at the bottom of the first page, copying is per-
`mitted provided that the per-copy fee indicated in the code is paid through the Copyright
`Clearance Center, 29 Congress Street, Salem, MA 01970. For permission to republish, write to:
`Director of Publications, Association for Computing Machinery, 11 West 42nd Street, New York,
`NY 10036.
`
`IEEE Computer Society Order Number 1948
`Library of Congress Number 85-642899
`IEEE Catalog Number 89CH2705-2
`ISBN 0-8186-1948-1 (paper)
`ISBN 0-8186—5948-3 (microfiche)
`ISBN 0-8186-8948-X (case)
`ACM Order Number 415890
`ISBN 0-89791-319-1
`ISSN 0884—7495
`
`Additional copies of the 1989 Proceedings may be ordered from:
`
`IEEE Service Center
`IEEE “mpu‘e' S°°Iety
`Order 33mm t
`445 Hoes Lane
`0“” “Fame“,
`P o 30,36,113
`PO. Box 1331
`10662 9’5 vaquems CW6
`Bairi'mo MD
`re,
`21264 Los Alamltos, CA 90720-2578 Piscataway, NJ 08855-1331
`
`IEEE Computer Society
`13, Avenue de l'Aquilon
`3-1200 Brussels
`BELGIUM
`
`IEEE Computer Society
`Ooshima Building
`2-191 Minami-Aoyama
`Minato-ku, Tokyo 107, JAPAN
`
`Q THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS. INC.
`
`iv
`
`Pf_-.5,
`K.Mg.
`
`9 a
`
`<9
`
`.-.._e,-...2.9___i “0'46,
`.1,__¢.._.Q~
`
`
`
`
`3 “
`
`43’
`
`
`4
`
`
`
`
`
`The 16th Annual International Symposium on Computer Architecture
`
`Table of Contents
`
`Cache Coherence and Synchronization I
`Chair: M. Dubois, USC
`
`Evaluating the Performance of Four Snooping Cache Coherency Protocols ................ 2
`SJ. Eggem and RH. Katz
`Multi-level Shared Caching Techniques for Scalability in VMP-MC ................... 16
`DR. Cheriton, HA. Goosen, and RD. Boyle
`Design and Performance of a Coherent Cache for Parallel Logic Programming
`Architectures .................................................. 25
`A. Goto, A. Matsumoto, and E. Tick
`
`Dataflow
`
`Chair: 1. Koren, UMASS Amherst
`
`The Epsilon Dataflow Processor ....................................... 36
`VG. Grafe, GS. Davidson, J.E. Hoch, and VP. Holmes
`An Architecture of a Dataflow Single Chip Processor ........................... 46
`S. Sakai, Y. Yamaguchi, K Hiraki, Y. Kodama, and T. Yuba
`,
`Exploiting Data Parallelism in Signal Processing on a Dataflow Machine ................ 54
`P. Nitezki
`
`Pipeline Architectures
`Chair: A. Gottlieb, NYU
`
`Architectural Mechanisms to Support Sparse Vector Processing ..................... 64
`RN. Ibbett, T.M. Hopkins, amd KI.M. McKinnon
`’
`A Dynamic Storage Scheme for Conflict-Free Vector Access ...................... 72
`D. T. Harper III and DA. Linebarger
`‘
`SIMP (Single Instruction stream/Multiple instruction Pipelining): A Novel High—Speed
`Single-Processor Architecture ........................................ 78
`K Murakami, N. Irie, M. Kuga, and S. Tomita
`
`Mapping Algorithms
`Chair: A.L. Davis, HP
`
`2-D SIMD Algorithms in the Perfect Shuffle Networks .......................... 88
`Y. Ben-Asher, D. Egozi, and A. Schuster
`Systematic Hardware Adaptation of Systolic Algorithms ......................... 96
`M Valera-Garcia, IJ. Navarro, J.M. Llaberia, and M. Valera
`Task Migration'in Hypercube Multiprocessors .............................. 105
`M.-S. Chen and K G. Shin
`
`Uniprocessor Caches
`Chair: D. Alpert, INTEL
`
`Characteristics of Performance-Optimal Multi—Level Cache Hierarchies ............... 114
`S. Przybylski, M Horowitz, and J. Hennessy
`,
`Supporting Reference and Dirty Bits in SPUR’s Virtual Address Cache ............... 122
`DA. Wood and RH. Katz
`
`Inexpensive Implementations of Set—Associativity ............................ 131
`RE. Kessler, R. 10035, A. Lebeck, and MD. Hill
`Organization and Performance of a Two-Level Virtual—Real Cache Hierarchy ............ 140
`W.-H. Wang J.-L. Baer, and HM. Levy
`
`XV
`
`
`
`
`
`5
`
`
`
`
`
`.
`.
`Networks
`Chair: H.J. Siegel, Purdue Umversrty
`High Performance Communications in Processor Networks ----------------------
`CR. Jesshope, P.R. Miller, and IT. Yantchev
`.
`.
`Introducing Memory into the Switch Elements of Multiprocessor Interconnection Networks .
`H.E. Mizrahi, J.-L. Baer, E.D. Lazowska, and]. Zahoq'an
`Using Feedback to Control Tree Saturation in Multistage Interconnection Networks ........ 167
`S.L. Scott and GS. Sohi
`.
`.
`.
`.
`Constructing Replicated Systems with Point-to-Pomt Communication Links ............. 177
`P.D.Ezhi1che1van, SK Shn'vastava, andA. Tully
`
`.
`
`.
`
`. 158
`
`150
`
`Prolog Architectures
`Chair: R. Ginosar, Technion
`KCM:AKnowledgeCrunchingMachine
`H. Benker, J.M. Beacco, S. Bescos, M. Dorochevsky, 7h. Jeflré, A. Péhlmann, J. Noye,
`B. Poten'e, A. Sexton, J.C. Syre, 0. Thibault, and G.'Watzlawik
`A High Performance Prolog Processor with Multiple Function Units ................. 195
`A. Singhal and Y.N. Patt
`Evaluation of Memory System for Integrated Prolog Processor IPP .................. 203
`M. Morioka, S. Yamaguchi, and T. Bandoh
`A Type Driven Hardware Engine for Prolog Clause Retrieval over a Large Knowledge Base .
`K—F. Wong and MH. Williams
`'
`
`.
`
`.
`
`. 211
`
`...... 186
`
`Instruction Fetching
`Chair: 1. Spillinger, Technion
`
`Comparing Software and Hardware Schemes for Reducing the Cost of Branches .......... 224
`W.W. ku, T.M. Conte, and RP. Chang
`Improving Performance of Small On-Chip Instruction Caches ..................... 234
`’
`MK. Farrens and AR. Pleszkun
`
`Achieving High Instruction Cache Performance with an Optimizing Compiler ............ 242
`WW. HM and RP. Chang
`The Impact of Code Density on Instruction Cache Performance .................... 252
`P. Steenkiste
`
`Parallel Architectures
`
`Chair: H. Miihlenbein, GMD
`
`
`
`
`
`Can Dataflow Subsume von Neumann Computing? ........................... 262
`RS. Nikki] and Arvind
`
`Exploring the Benefits of Multiple Hardware Contexts in a Multiprocessor Architecture:
`Preliminary Results ............................................. 273
`W.-D. Weber andA. Gupta
`Architectural and Organizational Tradeoffs in the Design of the MultiTitan CPU .......... 281
`MP. Jouppi
`Run-Time Checking in Lisp by Integrating Memory Addressing and Range Checking ........ 290
`M. Sato, S. Ichikawa, and E. Goto
`
`
`
`Perfomance Evaluation
`
`Chair: D. Siewiorek, CMU
`
`Multiple vs. Wide Shared Bus Multiprocessors ........
`...................
`A. Hopper; A. Jones, and D. Lioupis
`Performance Measurements on a Commercial Multiprocessor Running Parallel Code ....... 307
`M. Annaratone and R. Riihl
`Interprocessor Communication S
`Processors
`peed and Performance in Distributed-Memory Parallel
`............... ......
`-----------------------------
`
`300
`
`xvi
`
`Iw‘
`
`
`
`irrrrzicj_-i1o2..
`0'lI,a
`
`a ’
`
`.
`
` 4a.,
`
`6
`
`
`
`
`
`Analysis of Computation-Communication Issues in Dynamic Dataflow Architectures ........ 325
`D. Ghosal, S.K Tn'pathi, L.N. Bhuyan, and H. Jiang
`
`Logic Simulation Systems
`Chair: E. Kronstadt, IBM
`
`Logic Simulation on Massively Parallel Architectures .......................... 336
`S. Kravitz, R.E. Bryant, and R. Rutenbar
`R256: A Research Parallel Processor for Scientific Computation ................... 344
`T. Fukazawa, T. Kimura, M. Tomizawa, K Takeda, and Y. Itoh
`
`Special Purpose Architectures
`Chair: S. Ruhman, Weizmann Institute
`
`A Three-Port/Three-Access Register File for Concurrent Processing and I/O
`Communication in a RISC—Like Graphics Engine ............................ 354
`ML. Anido, DJ. Allermn, and EJ. Zaluska
`An Architecture Framework for Application-Specific and Scalable Architectures .......... 362
`J.M. Mulder, RJ. Farrier, A. Srivastava, and R. in ’t Velt
`
`Memory Systems
`Chair: G. Frieder, Syracuse University
`
`Perfect Latin Squares and Parallel Array Access ............................. 372
`K Kim and VIC Prasarma Kumar
`
`An Aperiodic Storage Scheme to Reduce Memory Conflicts in Vector Processors ......... 380
`S. Weiss
`
`Analysis of Vector Access Performance on Skewed Interleaved Memory ............... 387
`C.-L. Chen and (1-K Liao
`
`Cache Coherence and Synchronization II
`Chair: J. Goodman, University of Wisconsin, Madison
`
`Adaptive Backoff Synchronization Techniques .............................. 396
`A. Agarwal and M. Cherian
`‘
`A Cache Consistency Protocol for Multiprocessors with Multistage Networks ............ 407
`P. Stenstrom
`
`On Data Synchronizations for Multiprocessors .............................. 416
`H.-M. Su and P.-C. Yew
`
`Author Index ................................................. 425
`
`xvii
`
`
`
`7
`
`
`
`
`
`Inexpensive Implementations of Set-Associativity
`
`R. E. Kessler’, Richard Jooss, Alvin Lebeck and Mark D. Hrlri
`
`University of Wisconsin
`Computer Sciences Department
`Madison, Wisconsin 53706
`
`ABSTRACT
`
`set-
`implementing wide
`approach to
`traditional
`The
`associativity is expensive. requiring a wide tag memory (directory)
`and many comparators. Here we examine altemative implementa-
`tions of associativity that use hardware similar to that used to
`implement a direct-mapped cache. One approach scans tags seri-
`ally from most-recently used to least-recently used. Another uses a
`partial compare of a few bits from each tag to reduce the number of
`tags that must be examined serially. The drawback of both
`approaches is that they increase cache access time by a factor of
`two or more over
`the
`traditional
`implementation of
`set-
`associativity, making them inappropriate for cache designs in
`which a fast access time is crucial (e.g. level one caches, caches
`directly servicing processor requests).
`These schemes are useful, however, if (i) the low miss ratio
`of wide set-associative caches is desired, (2) the low cost of a
`direct-mapped implementation is preferred, and (3) the slower
`access time of these approaches can be tolerated. We expect these
`conditions to be true for caches in multiprocessors designed to
`reduce memory interconnection traffic, caches implemented with
`large, narrow memory chips, and level two (or higher) caches in a
`cache hierarchy.
`
`1. Introduction
`
`The selection of associativity has significant impact on cache
`performance and cost [Smit86] [Smit82] [Hill87] [Przy88a]. The
`associativity (degree of associativity, set size) of a cache is the
`number of places (block frames) in the cache where a block may
`reside. Increasing associativity reduces the probability that a block
`is not foundtin the cache (the miss ratio) by decreasing the chance
`that recently referenced blocks map to the same place [Smit78].
`However, increased associativity may nonetheless result in longer
`effective access times since it can increase the latency to retrieve
`data on a cache hit [Hi1188,Przy88a]. When it
`is important to
`minimize hit times direct-mapped (associativity of one) caches
`
`I This work has been supported by graduate fellowships from the National Sci~
`ence Foundation and the University of Wisconsin-Madison.
`* This work was sponsored in part by research initiation grants from the graduate
`school of the University of Wisconsin-Madison.
`
`Permission to copy without fee all or part of this material is granted
`provided that the copies are not made or distributed for direct commer-
`cial advantage, the ACM copyright notice and the title ‘of the publication
`and its date appear, and notice is given that copying is by permission of
`the Association for Computing Machinery. To copy othemise, or to
`republish, requires a fee and/or specific permission.
`
`© 1989 ACM 0884-7495/89/0000/0131$O1.50
`
`I31
`
`may be preferred over caches with higher associativity.
`Wide associativity is important when: (I) miss times are very
`long or (2) memory and memory interconnect contention delay is
`significant or sensitive to cache miss ratio. These points are likely
`to be true for shared memory multiprocessors. Multiprocessor
`caches typically service misses via a multistage interconnect or
`bus. When a multi-stage interconnect is used the miss latency can
`be large whether or not contention exists. Bus miss times with low
`utilizations may be small, but delays due to contention among pro-
`cessors can become large and are sensitive to cache miss ratio. As
`the cost of a miss increases, the reduced miss ratio of wider associ-
`ativity will result in better performance when compared to direct-
`mapped caches.
`Associativity is even more useful for level two caches in a
`two-level multiprocessor cache hierarchy. While the level one
`cache must service references from the processor at the speed of
`the processor, the level two cache can be slower since it services
`only processor references that miss in the level one cache. The
`additional hit time delay incurred by associativity in the level two
`cache is not as important
`[Pny88b]. Reducing memory and
`memory interconnect traffic is a larger concern. Wide associativity
`also simplifies the maintenance of multi-level inclusion [Baer88].
`This is the property that all data contained in lower level caches is
`contained in their corresponding higher level caches. Multi-level
`inclusion is useful for reducing coherency invalidations to level
`one caches. Finally, preliminary models indicate that increasing
`associativity reduces the average number of empty cache block
`frames when coherency invalidations are frequentl, This implies
`that wider associativity will result in better utilization of the cache.
`Unfortunately, increasing associativity is likely to increase
`the board area and cost of the cache relative to a direct—mapped
`cache. Traditional
`implementations of a—way set-associative
`caches read and compare all a tags of a set in parallel to determine
`where (and whether) a given block resides in the cache. With t-bit
`tags, this requires a tag memory that can provide a X t bits in paral-
`lel. A direct-mapped cache can use fewer, narrower, deeper chips
`since it requires only a t-bit wide tag memory. Traditional imple—
`mentations of associativity also use a comparators (each t—bits
`wide) rather than one, wider data memory, more buffers, and more
`multiplexors as compared to a direct-mapped cache. This adds to
`the board area needed for wider associativity. As the size of
`memory chips increases,
`it becomes more expensive to consume
`board area with multiplexors and other logic since the same area
`could hold more cache memory.
`While numerous papers have examined associativity [Lipt68]
`[Kapl73] [Bell74] [Stre76] [Smit78] [Smit82] [Clar83] [Agar88],
`most have assumed the traditional implementation. One of the few
`papers describing a cache with a non-traditional implementation of
`l A miss to a set-associative cache can fill any empty block frame in the set, whereas, I miss
`to a direct-mawed cache can fill only a single frame. Increasing associativity increases the
`chance that an invalidated block frame will be quickly used again by making more empty
`frames available for reuse on a miss.
`
`
`
`
`
`8
`
`
`
`
`
` a-l
`
`
`
`(b) Serial (Using Naive Approach)
`
`Figure 1. Implementing Set-Associativity.
`
`Part (a) of this figure (top) shows the traditional implementation of
`the logic to determine hit/miss in an a-way set—associative cache.
`This logic uses the “SET” field of the reference to select one t-bit
`tag from each of a banks. Each stored tag is compared to the incom-
`ing tag (“TAG"). A hit is declared if a stored tag matches the in-
`coming tag, a miss otherwise.
`Part (b) (bottom) shows a serial implementation of the same cache ar-
`chitecture. Here the a stored tags in a set are read from one bank and
`compared serially (the tags are addressed with "SET" concatenated
`with 0 through a — 1).
`
`
`It discusses a cache implemented for a
`associativity is [Chan87].
`System/370 CPU that has a one-cycle hit
`time to the most-
`recently-used (MRU) block in each set and a longer access time for
`other blocks in the set, similar to the Cray-1 instruction buffers
`[Cray76] and the biased set-associative translation buffer described
`in [Alex86].
`
`This paper is about lower cost implementations of associa-
`tivity, implementations other than the traditional. We introduce
`cache designs which combine the lower miss ratio of associativity
`and the lower cost of direct-mapped caches. In the new implemen-
`tations the width of the comparison circuitry and tag memory is t,
`the width of one tag, instead of the a X: required by the traditional
`implementation.
`Implementations using tag widths of b Xt
`(1 < b < a) are possible and can result in intermediate costs and
`performance, but are not considered here. This paper is not about
`level two caches per se, but we expect these low cost schemes to be
`applicable to level two caches. We organize this paper as follows.
`Section 2 introduces the new approaches to implementing associa—
`tivity, shows how they cost less than the traditional implementation
`of associativity, and predicts how they will perform. Section 3
`analyzes the approaches of Section 2 with trace-driven simulation.
`
`2. Alternative Approaches to Implementing Set-Associativity
`Let a, a power of two, be a cache’s associativity and let t be
`the number of bits in each address tag. During a cache reference,
`an implementation must determine whether any of the a stored
`tags in the set of a reference match the incoming tag. Since at most
`one stored tag can match, the search can be terminated when a
`match is found (a cache hit). Alla stored tags, however, must be
`examined on a cache miss.
`
`Figure 1a illustrates the traditional implementation of the tag
`memory and comparators for an a-way set-associative cache,
`which reads and probes all tags in parallel. We define a probe as a
`comparison of the incoming tag and the tag memory. If any one of
`the stored tags match, a hit is declared. We concentrate only on
`cache tag memory and comparators, because they are what we pro-
`pose to implement differently. Additional memory (not shown) is
`required by any implementation of associativity with a cache
`replacement policy other than random. A direct-mapped cache
`does not require this memory. The memory for the cache data (also
`not shown)
`is traditionally accessed in parallel with the tag
`memory.
`
`Figure 1b shows a naive way to do an inexpensive set-
`associative lookup.
`It uses hardware similar to a direct-mapped
`cache, but serially accesses the stored tags of a set until a match is
`found (a hit) or the tags of the set are exhausted (a miss). Note how
`it requires only a single comparator and a r-bit wide tag memory,
`whereas, the traditional implementation requires 1 comparators and
`an a xt wide tag memory.
`Unfortunately, the naive approach is slow in comparison to
`the traditional implementation. For hits, each stored tag is equally
`likely to hold the data. Half the non-matching tags are examined
`before finding the tag that matches, making the average number of
`probes (a—1)/2 + 1. For a miss, all a stored tags must be examined
`in series, resulting in a probes. The traditional implementation
`requires only a single probe in both cases.
`
`2.1. The MRU Approach
`The average number of probes needed for a hit may be
`reduced from that needed by the naive approach by ordering the
`stored tags so that the tags most likely to match are examined first.
`One proposed order [8088] [Matt70] is from most-recently-used
`(MRU) to least—recently-used (LRU). This order is effective for
`level one caches because of the temporal locality of processor
`reference streams [SO88] [Chan87]. We find (in Section 3) that it is
`also effective for level two caches due to the temporal locality in
`streams of level one cache misses.
`
`One way to enforce an MRU comparison order is to swap
`blocks to keep the most—recently—used block in block frame 0. the
`second most-recently—used block in block frame 1, etc. Since tags
`(and data) would have to be swapped between consecutive cache
`accesses in order to maintain the MRU order. this is not a viable
`implementation option for most set-associative caches.2 A better
`way to manage an MRU comparison order, illustrated in Figure 2a.
`is to store information for each set indicating its ordering. For-
`tunately, information similar to a MRU list per set is likely to be
`maintained anyway in a set—associative cache implementing a true
`LRU replacement policy.
`In this case there is no extra memory
`requirement to store the MRU information. We will also analyze
`(in section 3) reducing the length of the MRU list, using approxi-
`mate rather than full MRU searches, to further decrease memory
`requirements. Unfortunately, the lookup of MRU information must
`precede the probes of the tags3, This will lead to longer cache
`lookup times than would the swapping scheme.
`If we assume that the initial MRU list lookup takes about the
`same time as one probe. the average number of probes required on
`aacache lookup resulting in a hit using the MRU approach rs 1+
`i=1
`Zifr where f,-
`is the probability the i-th MRU tag matches.
`
`given that one of them will match4. The MRU scheme performs
`
`particularly poorly on cache misses, requiring 1 +a probes. This _
`
`2 While maintaining MRU order using swapping may be feasible for a 2-way
`set-associative cache, Agarwal's hash-rehash cache [Agar87] can be supefi“ to
`MRU in this Z-way case.
`3 While it is POSSible to lookup the MRU information in parallel with the level-
`one-cache access, it is also possible to start level-two-cache accesses early f“
`any of the other implementation approaches [Bren84].
`4 Each f" is equal to the probability of a reference to LRU distance i divided by
`the hit ratio, for a given number of sets [Smit78].
`
`
`
`UL
`
` l i
`
`L"
`
`9
`
`
`
`
`
`
`
`MRU INFO
`
`SET
`
`sar + MRU]
`sa’r + MRU2
`
`MRUIMRUZ MRU:
`
`SET + MRU:
`
`
`
`- ..1 SET+ all
`
`(a) Using MRU Order
`
`7
`
`SET+
`<see below>
`
`? SET + PM]
`SET+PM2
`.
`
`o
`
`I
`
`I-1
`
`(b) Using Partial Compares
`
`Figure 2. Improved Implementations of Serial Set-Associativity.
`Part (a) of this figure (top) shows an implementation of serial set-
`associativity usin ordering information. This approach first reads
`MRU ordering in ormation (left) and then probes the stored tags from
`the one most-likely to match to the one least-likely to match (right).
`Note “+” represents concatenate.
`'
`
`Part (b) (bottom) shows an implementation of serial set-associativity
`using partial compares. This approach first reads k (k =|_ t/aj ) bits
`from each stored tag and compares them with the corresponding bits
`of the incoming tag. The second step of this approach serially oom-
`pares all stored tags that partially matched (“PM") with the incom-
`ing tag until a match is found or the tags are exhausted (right).
`
`
`is one more than the naive implementation on misses since the
`MRU list is uselessly consulted.
`
`2.2. The Partial Compare Approach
`
`We have carefully defined a probe to be the comparison of
`the incoming tag and the tag memory, without requiring that all
`bits of the tag memory come from the same stored tag. We now
`introduce the partial compare approach that uses a two step pro-
`cess to often avoid reading all t bits of each stored tag. In step one,
`the partial compare approach reads t/a bits from each of astored
`tags and compares them with the corresponding bits of the incom-
`ing tag. Tags that fail this partial comparison carmot hit and need
`not be examined further on a cache lookup.
`In step two, all stored
`tags that passed step one are examined serially wrth t-bit (full)
`compares.
`
`The implementation of partial compares is not costly, as it
`can use the same memory and comparators as the naive approach
`assuming k, the partial compare width (k =L t/aJ ), is a multiple
`of memory chip and comparator width. Partial compares are done
`with the help of a few tricks. The first trick, illustrated in_Frgure
`2b, is to provide slightly different addresses to each. k—brt wrde
`collection of memory chips, addressing the i-th collection with the
`address of the set concatenated with logzi . The second trick rs tso
`
`divide the t-bit comparator into a separate k-bit comparators .
`
`thenL t/aj xa bits of the tag can be used for partial
`5 If kxa does not equal 1
`00mpares, with another comparator for the extra bits.
`
`1O
`
`l33
`
`This is straight-forward, since wide comparators are often imple-
`mented by logically AND-ing the results of narrow comparators.
`Note how step two of this partial compare approach uses the same
`tag memory and comparators as step one, but does full tag com-
`pares rather than partial compares.
`The performance of this approach depends on how well the
`partial compares eliminate stored tags from further consideration.
`For independent tags, the average number of probes will be minim-
`ized if each of the values [0. 2" — l] is equally likely for each of the
`k-bit patterns on which partial compares are done. While this con-
`dition may be true for physical address tags, it is unlikely to be true
`for the high order tag bits of virtual addresses. Nevertheless, we
`can use the randomness of the lower bits of the virtual address tag
`to make the distribution of the higher ones more uniform and
`independent. For example, one can transform a tag to a unique
`other tag by exclusive-oring the low-order k bits of the tag with
`each of the other k -bit pieces of the tag before it is stored in the tag
`memory.
`Incoming tags will go through the same transformation
`so that the incoming tag and the stored tag will match if the
`untransformed tags are the same. The original tags can be retrieved
`from the tag memory for writing back blocks on replacement via
`the same transformation in which they were stored (i.e.
`the
`transformation is its own inverse). This method is used throughout
`this paper to produce stored tags with better probabilistic chame-
`ten'stics. We will also analyze using no transformation. and using a
`more sophisticated one in Section 3. We make the assumption in
`our analysis to follow that each of the values [0, 2" — l] is equally
`likely and independent for each partial compare. Our trace-driven
`simulation (in Section 3) tests this assumption
`The probability that an incoming tag partially-matches a
`stored tag is 1/2". A false match is a partial tag match which will
`not lead to a match of the full tag. Given a hit,
`the expected
`number of false matches in step one is (a—1)/2", of which half
`will be examined in step two before a hit is determined. Thus. the
`expected number of probes on a hit is 1+ (a—1)/2"“ + 1, where
`the terms of the expression are: the probe for the partial comparison
`(step one), the full tag comparisons (in step two) due to false
`matches, and the full tag match which produces the hit, respec-
`tively. On a miss,
`the expected number of probes in simply
`1+ a /2", the probe for the partial comparison and the number of
`false matches, respectively.
`The partial compare scheme can lead to poor performance if
`many false matches are encountered in step two. Wider partial
`compares could eliminate some of these false matches. The partial
`compare width can be increased by partitioning the a stored tags of
`a set into s proper subsets (each containing a/s tags) and examin-
`ing the subsets in seriesfi. The step one and step two partial com-
`pare sequence is performed for each of the subsets to determine if
`there is a cache hit. The order in which the subsets are examined is
`arbitrary throughout this paper.
`Increasing the number of subsets
`will increase the partial compare width since fewer partial com-
`pares are done concurrently. For example, 2 subsets could be used
`in an 8-way set—associative cache, with 4 entries in each. A lockup
`in this cache would proceed as two 4-way (single subset) lookups,
`one after the other. With a 16-bit wide tag memory in this cache,
`partitioning into 2 subsets would result in 4-bit partial compares.
`This will result in fewer false matches than with the 2-bit partial
`compares without subsets. The number of probes per access
`decreases when using proper subsets if the expected number of
`false matches is reduced (due to wider partial compares) by more
`than the number of probes added due to the additional subsets.
`Subsets may be desirable for implementation considerations in
`addition to performance considerations if the memory chip or com-
`parator width dictate that the partial compares be wider.
`At one extreme (where s = a), partial compares with subsets
`would be implemented as the naive approach, while the other
`
`6 Note that subsets are not useful with the naive and MRU approaches.
`
`
`
`
`
`10
`
`
`
`gives at least four-bit partial compares will work well.
`Table 1 summarizes our analysis of the expected number of
`probes required for the traditional, naive, MRU and partial compare
`approaches to implementing set-associativity. Note that this table
`as well as most of the trace-driven simulation assumes 16 bit tags
`are used. We will examine the positive effect of increasing the tag
`width on the partial compare approach in section 3.
`
`Table 2 summarizes paper implementations of tag memory
`and comparison logic for a direct-mapped cache, a traditional
`implementation of set-associativity, and an implementation of set-
`associativity using MRU and partial compares. We found that the
`MRU and partial compare implementations have a slower access
`time than the traditional
`implementation of associativity but
`includes no implementation surprises. Most notably, the control
`logic was found to be of reasonable complexity. The MRU and
`panial compare implementations use hardware similar to a direct-
`mapped cache and can make effective use of page-mode dynamic
`RAMs, as would other serial implementations of set-associativity.
`Page-mode dynamic RAMs are those in which the access time of
`multiple probes to the same set can be significantly less than if the
`probes were to other sets. Subsequent probes take less than half the
`time of the first probe to the set. Cache cost is reduced in two ways
`when using one of the alternative implementations of associativity.
`First, tag memory cost is directly reduced, by 1/3 to 1/2 in our
`design. Second, cache data memory cost is reduced since only 1,
`rather than a words, need to be read at a time.
`
`max(l, a xk)
`1.25
`2+ (a-r)/2H1 1 + a 12"
`
`15
`2.09
`-
`‘
`
`
`PM“
`2 + (1/2) (xvi)
`2,
`
`WISUbsets
`S
`max(t, a/S Xk)
`+ (a_1)/2k+l
`S + a/
`
`2.88
`3.00
`(k = 2 bits)
`l
`16
`8
`(k =4 bits)
`16
`2.72
`2.5
`
`
`Table 1. Performance of Set-Associativity Implementations.
`
`For various methods and associativities this table gives the number of
`subsets, the tag memory width, the number of probes for a hit, and,
`the number for a miss. The table assumes t-bit tags (I =16), k~bit
`partial compares, and that the i-th most—recently used tag matches
`with probability f,- on a hit. Note how an increase from l to 2 subsets
`improved the predicted performance of the partial compare approach
`at an associativity of 8.
`
`‘
`
`(s = 1) can lead to many false matches. An important question to
`ask is: what number of subsets leads to the best performance (i.e.
`fewest number of probes per cache lookup) ? The next three
`answers to this question vary from the most-accurate to the most
`succinct.
`(1) One can compute the expected number of probes for
`each of 5 =1, 2.4,
`, a/2 and a using the equations for a hit and
`miss (from Table 1) weighted to reflect your expected miss ratio
`and choose the minimum.
`(2) One can ignore misses (which are
`less common and never require more than twice the probes of hits),
`assume variables are continuous, and find the optimum partial com-
`pare width, k0,” = log: — 1/2 for hits only. The optimum number
`of subsets for hits and misses together is likely to be the value for .9
`resulting from a partial compare width of L km] or [ kart] .
`(3)
`Finally, one can observe that many tags in current caches are
`between 16 and 32 bits wide, implying the number of subsets that
`
`3. Trace-Driven Performance Comparison
`This
`section analyzes
`the performance of the low-cost
`schemes to implement set—associz.tivity in level two caches using
`simulation with relatively short multiprogramming traces. We
`analyze associativity in the level two cache since the low cost
`implementations of associativity are more appropriate for level two
`(or higher) caches than for level one caches. We concentrate on
`presenting and characterizing the relative performance of