`Architecture
`News
`
`A Publication of the
`~
`Association for Computing Machinery
`Special Interest Group on Computer Architecture
`
`Vol 24, No. 2- May 1996
`
`.. ,.. - ~ - - ....... -
`
`SPECIAL ISSUE
`
`ISCA '23 PROCEEDINGS
`
`PROCEEDINGS OF THE 23RD
`ANNUAL INTERNATIONAL
`SYMPOSIUM ON COMPUTER
`ARCHITECTURE
`
`May 22-24 1996
`
`I
`Philadelphia, Pennsylvania
`
`001
`
`
`
`Proceedings
`
`The 23rd Annual International
`Symposium on
`
`COMPUTER AR.CHITECTURE
`
`May 22-24, 1996
`
`Philadelphia, Pennsylvania
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`
`• SIGARCH
`
`002
`
`
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright c 1996 by Association for Computing Machinery, Inc.(ACM). Permission to make
`digital or hard copies of part or all of this work for personal or classroom use is granted
`without fee provided that the copies are not made or distributed for profit or commercial
`advantage and that copies bear this notice and the full citation on the first page. Copyrights
`for components of this work owned by others than ACM must be honored. Abstracting with
`credit is permitted. To copy otherwise, to republish, to post on· servers., or to redistribute to
`lists, requires prior specific permission and/or ctfee. Request permission to republish from:
`Publications Dept. ACM, Inc. Fax + 1 (212) 869-0481 or <permissions@acm.org>
`For other copying of articles that carry a code at the bottom of the first or last page or screen
`display, copying is permitted provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`ACM ISBN: 0-89791-786-3
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm _ europe@acm.org (in Europe)
`
`ACM Order Number: 415960
`
`Printed in the U.S.A.
`
`003
`
`
`
`Table of Contents
`
`General Chair's Message ................................................................................. viii
`Program Chair's Message .................................................................................. ix
`Organizing Committees ..................................................................................... x
`Reviewers ................................................................................................... xi
`
`Session I : Keynote Address .......................................................................... 1
`Forest Baskett
`
`Session IIA : Branch Prediction
`Joel Emer, Chair ................................................ 2
`Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches ... 3
`Marius Evers, Po- Yung Chang, and Yale N. Patt
`
`An Analysis of Dynamic Branch Prediction Schemes on System Workloads ..................................... 12
`Nicolas Gloy, Cliff Young, J. Bradley Chen, and Michael D. Smith
`
`Correlation and Aliasing in Dynamic Branch Predictors ....................................................... 22
`Stuart Sechrest, Chih-Chieh Lee, and Trevor N. Mudge
`
`Session liB : Shared Memory
`Per Stenstrom, Chair .............................................. 33
`Decoupled Hardware Support for Distributed Shared Memory ................................................. 34
`Steven K. Reinhardt, Robert W. Pfile, and David A. Wood
`
`MGS: A Multigrain Shared Memory System ............................................................. .' ..... 44
`Donald Yeung, John Kubiatowicz, and Anant Agarwal
`
`COMA: an Opportunity for Building Fault-Tolerant Scalable Shared Memory ¥ultiprocessors .................. 56
`Christine Morin, Alain Gefflaut, Michel Banatre, and Ann~-Mane Kermarrec
`
`Session III : Processor /Memory Tradeoffs David Patterson, Chair ............................. 66
`Evaluation of Design Alternatives for a Multiprocessor Microprocessor ......................................... 67
`Basem A. Nayfeh, Lan.ce Ham.mond, and Kun!e Olukotun
`
`Memory Bandwidth Limitations of Future Microprocessors .................................................... 78
`Doug Burger, James R. Goodman, and Alain Kagi
`
`Missing the Memory Wall: The Case for Processor /Memory Integration ........................................ 90
`··
`Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk
`
`Panel: Research dpportunities and Critiques: An Indust~ial Perspective ............. 102
`David Patterson, organizer
`
`v
`
`004
`
`
`
`Session VA : Cache Organization
`Corinna Lee, Chair .......................................... 103
`Don't use the page number, but a pointer to it ................................................ , .............. 104
`Andre Seznec
`
`The Difference-bit Cache ..................................................................................... 114
`Toni Juan, Tomas Lang, and Juan J. Navarro
`
`Session VB : Application Implications for MP Systems
`Thomas Gross, Chair ........... 121
`Understanding Application Performance on Shared Virtual Memory Systems .................. ~ .............. 122
`Liviu lftode, Jaswinder Pal Singh, and Kai Li
`
`Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines ............... 134
`Chris Holt, Jaswinder Pal Singh, and John Hennessy
`
`Session VIA : Superscalar Memory Systems Matt Farrens, Chair ........................... 146
`Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors ...................... , ............. 147
`Kenneth M. Wilson, Kunle Olukotun, and Mendel Rosenblum
`
`High-Bandwidth Address Translation for Multiple-Issue Processors ............ ." ........... : . ................. 158
`Todd M. Austin and Gurindar S. Sohi
`
`Session VIB : 1/0 and Interrupts
`Jai Menon, Chair ............................................ 168
`DCD-Disk Caching Disk: A New Approach for Boosting 1/0 Performance ......................... , .......... 169
`\
`Yiming Hu and Qing Yang
`
`Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling .......................... 179
`Olivier Maquelin, Guang R. Gao, Herbert H. J. Hum, Kevin Theobald, and Xinmin Tian
`
`Invited Address ....... , ................................................................................ 189
`50 years ago: Presper Eckert, John Mauchly, and the Origins of Computer Architecture
`Maurice Wilkes
`
`Eckert-Mauchly Award ......................... , .................................................... 189
`Yale Patt, recipient
`
`\
`Session VIlA : Processor Microarchit.ecture
`James E. Smith, Chair ......................... 190
`Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous MultiThreading Processor . 191
`Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm
`
`Evaluation of Multithreaded Uniprocessors for Commercial Application Environments ................... , .... 203
`Rick J. Eickmeyer, Ross E. Johnson, Steve R. Kunkel, Shiafun Liu, and MarkS. Squillante
`
`Performance Comparison of ILP Machines with Cycle Time Evaluation ......... , ............................. 213
`Tetsuya Hara, Hideki Ando, Chikako Nakanishi, Masao Nakaya
`
`vi
`
`005
`
`
`
`Session VIlli : Networks Kai Li, Chair ........................................................... 225
`Rotating Combined Queueing (RCQ):
`Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks ................................ 226
`Jae H. Kim and Andrew A. Chien
`
`A Router Architecture for Real-Time Point-to-Point Networks ................................................ 237
`Jennifer Rexford, John Hall, and Kang G. Shin
`
`Coherent Network Interfaces for Fine-Grain Communication .................................................. 247
`Shubhendu S. Mukherjee, Babak Falsafi, Mark D. Hill, and David A. Wood
`
`Session VIII : Performance Evaluation and Optimization
`Patricia Teller, Chair ......... 259
`Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors ............... 260
`Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith
`
`\
`Instruction Prefetching of Systems Codes With Layout·Dptimized for Reduced Cache Misses .................. 271
`Chun Xia and Josep Torrellas
`
`Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and
`Performance Evaluation ...................................................................................... 283
`Lynn Choi and Pen-Chung Yew
`
`Session IX : Systems
`Steve Scott, Chair ............................................................ 295
`Early Experience with Message-Passing on the SHRIMP Multicomputer ...................................... 296
`Edward W. Felten, Richard D. Alpert, Angelos Bilas, Matthias A. Blumrich, Douglas W. Clark,
`Stefanos N. Damianakis, Cezary Dubnicki, Liviu Iftode, and K ai Li
`
`STiNG: A CC-NUMA Cobputer System for the Commercial Marketplace .................................... 308
`Tom Lovett and Russell Clapp
`
`Author Index ................................................................... _ ........................... 318
`
`vii
`
`006
`
`
`
`Author Index
`
`Anant Agarwal .................... .44
`Richard D. Alpert ................ 296
`Hideki Ando ...................... 213
`Todd M. Austin ................... 158
`Michel Banatre .................... 56
`Forest Baskett ...................... 1
`Angelos Bilas ..................... 296
`Matthias A. Blumrich ............. 296
`Doug Burger ....................... 78
`Po-Yung Chang ..................... 3
`J. Bradley Chen ................... 12
`Andrew A. Chien ................. 226
`Lynn Choi .. , ..................... 283
`Russell .Clapp ..................... 308
`Douglas W. Clark ................. 296
`Stefanos N. Damianakis ........... 296
`Cezary Dubnicki .................. 296
`Susan J. Eggers ................... 191
`Rick J. Eickmeyer ................. 203
`Joel S. Emer ...................... 191
`Marius Evers ........................ 3
`Babak Falsafi ..................... 247
`Edward W. Felten ................ 296
`Guang R. Gao .................... 179
`Alain Gefllaut ...................... 56
`Nicolas Gloy ....................... 12
`James R. Goodman ................ 78
`John Hall ......................... 237
`Lance Hammond ................... 67
`Tetsuya Hara ..................... 213
`John Hennessy .................... 134
`Mark D. Hill ...................... 247
`Chris Holt ........................ 134
`Mark Horowitz .................... 260
`Yiming Hu ........................ 169
`Herbert H. J. Hum ................ 179
`Liviu Iftode .................. 122, 296
`Ross E. Johnson .................. 203
`Toni Juan ....................... ~ 114
`Alain Kagi ........................ :78
`Anne-Marie Kermarrec ............. 56
`Jae H. Kim ....................... 226
`John Kubiatowicz .................. 44
`Steve R. Kunkel .................. 203
`Tomas Lang ...................... 114
`
`Chih-Chieh Lee .................... 22
`Henry M. Levy ................... 191
`Kai Li ....................... 122, 296
`Shiafun Liu ....................... 203
`Jack L. Lo ........................ 191
`Tom Lovett ....................... 308
`Olivier Maquelin .................. 179
`Margaret Martonosi ............... 260
`Christine Morin .................... 56
`Todd C. Mowry ................... 260
`Trevor N. Mudge ................... 22
`Shubhendu S. Mukherjee .......... 247
`Chikako Nakanishi ................ 213
`Masao Nakaya .................... 213
`Juan J. Navarro ................... 114
`Basem A. N ayfeh .................. 67
`Andreas Nowatzyk ................. 90
`Kunle Olukotun ............... 67, 147
`Yale N. Patt ......................... 3
`Robert W. Pfile .................... 34
`Fong Pong ......................... 90
`Steven K. Reinhardt ............... 34
`Jennifer Rexford .................. 237
`Mendel Rosenblum ................ 147
`Ashley Saulsbury ..... 1 . . . . . . . . . . . . . 90
`Stuart Sechrest ....... · ............. 22
`Andre Seznec ..................... 104
`Kang G. Shin ..................... 237
`Jaswinder Pal Singh .... · ...... 122, 134
`Michael D. Smith ............. 12, 260
`Gurindar S. Sohi .................. 158
`MarkS. Squillante ................ 203
`Rebecca L. Stamm ................ 191
`Kevin Theobald ................... 179
`Xinmin Tian ...................... 179
`J osep Torrellas .................... 271
`Dean M. Tullsen .................. 191
`Kenneth M. Wilson ............... 14 7
`David A. Wood ............... 34, 247
`Chun Xia ......................... 271
`Qing Yang ........................ 169
`Donald Yeung ...................... 44
`Pen-Chung Yew ................... 283
`Cliff Young ........................ 12
`
`318
`
`007
`
`
`
`Polling Watchdog: Combining Polling and Interrupts
`for Efficient Message Handling
`
`Olivier Maquelin, Guang R. Gao, Herbert H.J. Humt,
`Kevin B. Theobald, Xin-Min Tian ,
`
`School of Computer Science
`McGill University
`3480 University St.
`Montreal, Canada, H3A 2A 7
`
`t Dept. of Elec. and Comp. Eng.
`Concordia University
`1455 de Maisonneuve W.
`Montreal, Canada, H3G 1M8
`
`Abstract
`
`Parallel systems supporting multithreading, or message
`passing in general, have typically used either polling or in(cid:173)
`terrupts to handle incoming messages. Neither approach is
`ideal; either may lead to excessive overheads or message(cid:173)
`handling latencies, depending on the application. This pa(cid:173)
`per investigates a combined approach-Polling Watchdog,
`where both are used depending on the circumstances. The
`Polling Watchdog is a simple hardware extension that lim(cid:173)
`its the generation of interrupts to the cases where explicit
`polling fails to handle the message quickly. As an added
`benefit, this mechanism also has the potential to simplify
`the interaction between interrupts and the network accesses
`performed by the program.
`We present the resulting performance for the EARTH(cid:173)
`MANNA-S system, an implementation of the EARTH (Ef(cid:173)
`ficient Architecture for Ru'nning THreads) execution model
`on the MANNA multiprocessor. In contrast to the original
`EARTH-MANNA system, this system does not use a ded(cid:173)
`icated communication processor. Rather, synchronization
`and communication tasks are performed on the same pro(cid:173)
`cessor as the regular computations. Therefore, an efficient
`message-handling mechanism is essential to good perfor-.
`mance. Simulation results and performance measurements
`show that the Polling Watchdog indeed performs better than
`either polling or interrupts alone. In fact, this mechanism
`allows the EARTH-MANNA-S system to achieve the same
`level of performance as the original EARTH-MANNA mul(cid:173)
`tithreaded system.
`
`1
`
`Introduction
`
`Today, fast microprocessors are commonly used as building
`blocks in large multiprocessor systems, due to their excel(cid:173)
`lent price/performance ratios. Unfortunately, since these
`processors are driven by mass markets, they are mostly op(cid:173)
`timized for use in personal computers and workstations, and
`offer only limited support for interprocessor communication.
`Nevertheless, the high development costs of current .hard(cid:173)
`ware make it virtually impossible for custom hardware to
`Permission to make dlgitalittard copy of part or all of this work for personal
`or classroom use is granted without fee provided that copies are not made
`or distributed for profit or commercial advantage, the copyright notice, the
`title of the publication and Its date appear, and notice is given that
`copying Is by permission of ACM, Inc. To copy otherwise, 1o republish, lo
`post on servers, or lo redistribute lo lists, requlres prior specific permission
`and/or a fee.
`
`ISCA '96 5196 PA, USA
`C 1996 ACM o-89791-786-319610005 ... $3.50
`
`179
`
`comp;t~ with "killer IJlicros" [3], especially given the rela(cid:173)
`tively small size of the .market for supercomputing. If par(cid:173)
`allel architects have to "give in" and use off-the-shelf micro(cid:173)
`processors, then it is important to study how best to take
`advantage of their capabilities.
`yonsiderable gains in communication performance can
`be achieved by making the network interface accessible by
`user-level code [8) and by using lpw-latency message passing
`mechanisms such as Active Messages [20). As designers try
`to improve communication performance further, especially
`for short messages, ways must be found for the system to
`react to network events quickly and cost-effectively.
`We believe that the multithreaded program execution
`model has great potential for future parallel systems due to
`its ability to tolerate the communication and synchroniza(cid:173)
`tion latencies inherent in parallel architectures [1, 5, 7, 14,
`17). As long as there is enough parallelism in an applica(cid:173)
`tion, a multithreaded architecture can hide these latencies
`by switching to another thread. However, this necessitates
`the smooth integration of asynchronous events into the com(cid:173)
`putation [17). Since multithreaded systems face the same
`competitive pressures as other multiprocessors, it may be
`necessary to implement this integration using stock hard(cid:173)
`ware.
`In this paper, we present a message-handling mechanism
`designed for the EARTH-MANN A-S system, an implemen(cid:173)
`tation of the EARTH execution model [9, 10, 11] on the
`MANNA multiprocessor [4). This system, which is described
`in more detail in section 2.1, implements a multithreading
`environment on a machine based on Intel i860 XP RISC pro(cid:173)
`cessors. In contrast to the original EARTH-MANNA imple(cid:173)
`mentation [9, 10) which served as an emulation pla~form for
`our investigations of EARTH concepts, EARTH-MANN A-S
`does not use a dedicated communication co-processor. In(cid:173)
`stead, in each node a single CPU handles both computations
`and communications. Thus, the experiences of our EARTH(cid:173)
`MANNA-S system are generally applicable to multithreaded
`runtime systems implemented on conventional multiproces(cid:173)
`sors consisting of single-CPU nodes.
`On most machines in which a single CPU performs both
`functions, one of two mechanisms is 'USed to handle incom(cid:173)
`ing messages. Either there are explicit polls in the applica(cid:173)
`tion code (added by the user or compiler), or incoming mes(cid:173)
`sages generate interrupts in the CPU. However, using either
`polling or interrupts exclusively can be inefficient, espeCially
`when message traffic is high and its irregular pattern is not
`statically analyzable. If polling is used, but the message fre(cid:173)
`quency is much lower than the polling frequency, the polling
`
`008
`
`
`
`overhead will be incurred many times for each message. If,
`on the other hand, interrupts are used and the message fre(cid:173)
`quency is high, the interrupt costs can be significant, even
`if the interrupt handler is able to service all outstanding
`messages when it is invoked. There is a. definite need for a
`better solution, as the frequency of network events can vary
`dramatically from program to program and even during one
`program's execution.
`The approach described in this paper seeks to combine
`the advantages of both polling and interrupts. A simple
`extension to the hardware, called a Polling Watchdog, gen(cid:173)
`erates interrupts only when explicit polling fails to han(cid:173)
`dle incoming messages in a timely manner. In a conven(cid:173)
`tional OS, a watchdog timer may be used by the kernel to
`regulate interrupt generation in order to prevent a process
`from running too long [19]. Similar mechanisms have also
`been used in real-time systems to improve response time.
`The Polling Watchdog proposed in this paper is targeted
`toward general-purpose multiprocessor architectures with
`multithreaded program execution models. It is to be used
`in conjunction with normal user-level polling to improve the
`performance of message handling (data communication and
`synchronization) between multiple threads within one user
`application, and no OS intervention is required.
`While tlie hardware for the Polling Watchdog has not
`been implemented, we have modeled its behavior in two
`ways. In the first method, the second CPU of each MANNA
`node is used to approximate the Polling WatChdog's func(cid:173)
`tion. The second method uses a detailed cycle-by-cycle
`simulation of EARTH-MANNA-S with a Polling Watchdog
`added (see section 3.3).
`In this paper, we make the following contributions:
`
`• We propose a combined approach for efficient message
`handling. The Polling Watchdog uses interrupts, but
`limits them to the cases where explicit polling fails to
`handle messages quickly.
`
`• In the presence of interrupts, some kind of locking mech(cid:173)
`anism is needed to guarantee atomic access to shared re(cid:173)
`sources. We propose a simple solution for that problem
`under our scheme.
`
`• We compare experimentally the per-message overheads
`and latencies of several message-handling mechanisms.
`Our experiments cover both the simulation of a hard(cid:173)
`ware implementation of the Polling Watchdog and the
`current implementation in the EARTH-MANNA-S sys(cid:173)
`tem. Among our findings:
`
`-
`
`In terms of latency, the Polling Watchdog clearly
`achieves better performance than a naive polling-only
`approach, and the same, performance as an optimal
`poJ?ng-only solution.
`\
`- The overhead per messal?;e of our scheme is co.nsid(cid:173)
`erably lower than for an interrupt-based implementa(cid:173)
`tion, especially in the presence of fine-grained compu(cid:173)
`tations. Polling-only approaclles can yield even lower
`overheads, but only at the cost of programmer inter(cid:173)
`vention and with a higher sensitivity to variations of
`the message-arrival rate.
`
`• We also demonstrate the performance gains achieved
`for a reeresentative set of benchmark programs. These
`results are very encouraging. Although the EARTH(cid:173)
`MANNA-S system does not have a dedicated commu(cid:173)
`nication processor, it is able to achieve the same level of
`performance as the original EARTH-MANNA system.
`
`• We give some preliminary guidelines on how to choose .
`a timeout value for the Polling Watchdog. The effect
`of the timeout value on execution time is shown for the
`same set of benchmark programs.
`
`The rest of the paper is organized as follows: the next
`section briefly describes the EARTH model and the EARTH(cid:173)
`MANNA-S system, in particular the challenges facing de(cid:173)
`signers of such a system. Section 3 describes the Polling
`Watchdog mechanism, as well as the implications for guar(cid:173)
`anteeing atomicity. Section 4 discusses the performance of
`several message handling mechanisms, based both on exper(cid:173)
`iments with the EARTH-MANNA-S system and on simula(cid:173)
`tion results. Then, section 5 shows the performance gains
`achieved for a representative. set of benchmark programs and
`compares these results with the performance of the origi(cid:173)
`nal EARTH-MANNA system. Finally, related work is men(cid:173)
`tioned and some conclusions are drawn.
`
`2 The EARTH-MANNA-5 System
`
`The issues discussed in this paper were encountered while
`mapping the EARTH model to a machine with only a single
`CPU per node. Both the original EARTH-MANNA system
`and the EARTH-MANNA-S system run on the MANNA
`multiprocessor, which has two CPUs per node. However,
`while the EARTH-MANNA implementation dedicates one of
`them to communication and synchronization, the EARTH(cid:173)
`MANNA-S system uses only a single CPU. This second im(cid:173)
`plementation was made in order to assess the usefulness of a
`dedicated communication processor and to study the prob(cid:173)
`lems involved in porting the EARTH model to single-CPU
`node architectures.
`
`2.1 The EARTH Programming Model
`I
`EARTH (Efficient Architecture fpr Running THreads) [9]
`supports a multithreaded program execution m9del in which
`code is divided into threads that are scheduled atomically
`usi~g dataflow-like synchronization operations [9, 10, 11].
`These "EARTH operations" comprise a rich set of prim(cid:173)
`itives, including remote loads and stores, synchronization
`operations, block data transfers, remote function calls and
`dynamic load balancing. EARTH operations are initiated
`by the threads themselves. Once a thread is started, it runs
`to completion, and instructions within it are executed in
`sequential order. 1 Therefore, a conventional processor can
`execute a thread efficiently, even when the thread is purely
`sequential. For this reason, it is possible to ~btain single(cid:173)
`node performance close to that of a purely sequential imple(cid:173)
`mentation, as shown in our earlier work and recapitulated
`in section 5.
`Conceptually, each EARTH node consists of an Execu(cid:173)
`tion Unit (EU), which executes the threads, and a Synchro(cid:173)
`nization Unit (SU), which performs the EARTH operations
`requested by the threads (see figure 1). The current hard(cid:173)
`ware designs for EARTH use an off-the-shelf high-end RISC
`processor for the EU and custom hardware for the SU. How(cid:173)
`ever, other implementations are also possible.
`The EARTH model was implemented on the MANNA [4]
`architecture. Each MANNA node contains two Intel i860 XP
`RISC 'CPUs, 32MB of local memory and a 50 MB/s bidi(cid:173)
`rectional network interface. The inten;Qnnection network
`is based on 16 X 16 crossbars, which support the full link
`
`1 Instructions may be executed out of order, as on a superscalarma(cid:173)
`chine, as long as the semantics of the sequential ordering are obeyed.
`
`180
`
`009
`
`
`
`-------------------·
`.--------, :
`
`L-;...------r_J :
`
`I
`I
`
`I
`I
`I
`' I
`I
`I
`I
`I
`I
`I
`
`' I
`' ' I
`
`- - - - - - - - - - - - - - - - __ J
`
`rowed down to the following:
`Is it possible to keep both
`the per-message overhead and the latency of message han(cid:173)
`dling to sufficiently low values on machines with a single
`off-the-shelf microprocessor per node? Low latencies can be
`achieved with interrupts, but on today's RISC CPUs this
`mechanism is quite expensive. The lowest overheads are
`achieved with explicit polling, but how do we then guaran(cid:173)
`tee short latencies without compromising sequential execu(cid:173)
`tion performance? The rest of this paper will address that
`question by describing the message handling mechanism im(cid:173)
`plemented in the EARTH-MANN A-S system.
`
`Network
`
`3 The Polling Watchdog
`
`Figure 1: The EARTH Architecture
`
`bandwidth. Because two CPUs are available per node, one
`of them is used as the EU, executing code compiled from
`the user program, while the other CPU runs a fixed set of
`routines that emulate the functions of the SU. The resulting
`system is called EARTH-MANNA [9, 13).
`
`2.2 Mapping EARTH to a Single CPU
`
`Since most multiprocessor systems only have a single CPU
`per node, an obvious question was whether the EARTH
`model can be implemented with competitive performance
`and efficiency on such machines, and what key issues and
`challenges must be addressed. To answer these questions,
`we did another implementation of the EARTH model on
`MANNA called EARTH-MANNA-8, which uses only one
`of th~ tw'o CPUs to perform the functions of both the EU
`and SU.
`With only a single CPU to execute both the program
`code and the multithreadirlg support code, it is necessar~ t.o
`find an efficient way to switch from one to the other. For Ini(cid:173)
`tiating external requests and even for context switching this
`·is not too much of a problem, as this occurs at specific loca(cid:173)
`tions in the code. The operations can therefore be expanded
`inline or implemented as function calls. Efficient user-level
`access to the network can dramatically reduce the overheads
`for message sends [20) and the costs of context switching
`can also be kept quite low with appropriate compiler sup(cid:173)
`port [13). However, asynchronous reception of messages is
`more difficult, as it has to be merged into the flow of control
`dynamically, at run-time. This problem is not unique to the
`EARTH model; others have encountered similar difficulties
`[17).
`Two main factors determine the efficiency of a message
`handling mechanism: the overhead per message and the av(cid:173)
`erage handling latency (time to respond to an incoming mes(cid:173)
`sage). Low per-message overheads are crucial when large
`numbers of short messages are exchanged. This can eas(cid:173)
`ily be the case with the EARTH-MANNA-S system, as the
`efficient support for multithreading makes it possible to ex(cid:173)
`ploit parallelism at a fine level of granularity, down to thread
`lengths of just a few hundreds of instructions. Low response
`times are important when there is little parallelism. Multi(cid:173)
`threading systems are able to tolerate long response times
`by running other threads while the communication is in
`progress, but only if enough threads are available to hide
`all the delays.
`The questions mentioned above can therefore be nar-
`
`Explicit polling and interrupts both have their strengths and
`weaknesses. Polling has been used for many message-passing
`implementations designed to support small messages effi(cid:173)
`ciently, such as the Active Messages implementation on the
`CM-5 [17). However, interrupts have the disti~ct advant~ge
`of being generated only when a message arnves. Poll m(cid:173)
`structio,ns are always executed, whether or not a message
`is there 'to be processed. Therefore, polling becomes waste(cid:173)
`ful when the frequency of messages is much lower than the
`polling frequency. Addressing this problem simply by re(cid:173)
`ducing the polling frequency does not always lead to better
`performance, because even when the number of messages
`is low, it might be necessary to respond to these messages
`quickly in order to achieve good overall efficiency.
`This may not be a problem for systems such as the TAM
`implementation on the CM-5 [6) (we call this TAM/CM-5),
`where polling once per thread is sufficient due to the fine
`granularity of that model. In that particular case, polling
`can be integrated with the other multithreading support op(cid:173)
`erations. However, for other models such as EARTH, where
`larger threads are common, polling once per thread is not
`always sufficient. Additional polling instructions could be
`added at appropriate locations in the code, either by the pro(cid:173)
`grammer or by the compiler. However, this can be diffi~ult
`to implement in practice. If at all possible, such machme(cid:173)
`specific issues should be ·hidden from the user.
`
`3.1 . Mixing Polling and Interrupts
`If polling and interrupts both have advantages and disadvan(cid:173)
`tages, would it be possible to let th~ machine switch from one
`mode to the other at run-time, depending on circumstances?
`To answer this question, we developed the P?lling Watch(cid:173)
`dog mechanism. The EARTH-MANN A-S system polls the
`network during context switches, as is done, for instance, in
`TAM/CM-5. When a message arrives, a timer starts count(cid:173)
`ing. If the message is not removed frbin the network t~rough
`polling within a given amount of. time (the watchdog tlmeout
`Twdog), the watchdog interrupts the CPU.2 This is shown in
`the state diagram in figure 2, in which transitions occur ev(cid:173)
`ery clock cycle (Twdog is measured in clock cycles).
`Figure 3'illustrates the Polling Watchdog in action. In
`this example, a message arrives at time t1 and is handled
`through polling at time t2. At time t3 a second message
`arrives, which is handled through an interrupt at time t4
`because the network was not polled quickly enough. Note
`that the watchdog timer is reset, but continues to count until
`the message is actually removed from the network. If the
`interrupt code fails to empty the network buffers, another
`
`2 A similar mechanism can be seen in conventional operating sys(cid:173)
`tems, in which a watchdog timer is used to prevent a process from
`running too long [19].
`
`181
`
`010
`
`
`
`limit for Twa.og, but what about the upper limit? That limit
`depends on the amount of independent computations that
`can be performed after a message has been sent. If enough
`parallelism is available, a multithreaded system can hide
`substantial delays without noticeable performance degrada(cid:173)
`tion. On the other hand, if no parallelism is available it
`might be necessary to set Twa.og to a very low value to get
`optimal performance. The optimal timeout value will in
`general depend on the program. However, our e