throbber
Computer
`Architecture
`News
`
`A Publication of the
`~
`Association for Computing Machinery
`Special Interest Group on Computer Architecture
`
`Vol 24, No. 2- May 1996
`
`.. ,.. - ~ - - ....... -
`
`SPECIAL ISSUE
`
`ISCA '23 PROCEEDINGS
`
`PROCEEDINGS OF THE 23RD
`ANNUAL INTERNATIONAL
`SYMPOSIUM ON COMPUTER
`ARCHITECTURE
`
`May 22-24 1996
`
`I
`Philadelphia, Pennsylvania
`
`001
`
`

`
`Proceedings
`
`The 23rd Annual International
`Symposium on
`
`COMPUTER AR.CHITECTURE
`
`May 22-24, 1996
`
`Philadelphia, Pennsylvania
`
`Sponsored by
`
`ACM SIGARCH
`IEEE Computer Society, TCCA
`
`• SIGARCH
`
`002
`
`

`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright c 1996 by Association for Computing Machinery, Inc.(ACM). Permission to make
`digital or hard copies of part or all of this work for personal or classroom use is granted
`without fee provided that the copies are not made or distributed for profit or commercial
`advantage and that copies bear this notice and the full citation on the first page. Copyrights
`for components of this work owned by others than ACM must be honored. Abstracting with
`credit is permitted. To copy otherwise, to republish, to post on· servers., or to redistribute to
`lists, requires prior specific permission and/or ctfee. Request permission to republish from:
`Publications Dept. ACM, Inc. Fax + 1 (212) 869-0481 or <permissions@acm.org>
`For other copying of articles that carry a code at the bottom of the first or last page or screen
`display, copying is permitted provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`ACM ISBN: 0-89791-786-3
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm _ europe@acm.org (in Europe)
`
`ACM Order Number: 415960
`
`Printed in the U.S.A.
`
`003
`
`

`
`Table of Contents
`
`General Chair's Message ................................................................................. viii
`Program Chair's Message .................................................................................. ix
`Organizing Committees ..................................................................................... x
`Reviewers ................................................................................................... xi
`
`Session I : Keynote Address .......................................................................... 1
`Forest Baskett
`
`Session IIA : Branch Prediction
`Joel Emer, Chair ................................................ 2
`Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches ... 3
`Marius Evers, Po- Yung Chang, and Yale N. Patt
`
`An Analysis of Dynamic Branch Prediction Schemes on System Workloads ..................................... 12
`Nicolas Gloy, Cliff Young, J. Bradley Chen, and Michael D. Smith
`
`Correlation and Aliasing in Dynamic Branch Predictors ....................................................... 22
`Stuart Sechrest, Chih-Chieh Lee, and Trevor N. Mudge
`
`Session liB : Shared Memory
`Per Stenstrom, Chair .............................................. 33
`Decoupled Hardware Support for Distributed Shared Memory ................................................. 34
`Steven K. Reinhardt, Robert W. Pfile, and David A. Wood
`
`MGS: A Multigrain Shared Memory System ............................................................. .' ..... 44
`Donald Yeung, John Kubiatowicz, and Anant Agarwal
`
`COMA: an Opportunity for Building Fault-Tolerant Scalable Shared Memory ¥ultiprocessors .................. 56
`Christine Morin, Alain Gefflaut, Michel Banatre, and Ann~-Mane Kermarrec
`
`Session III : Processor /Memory Tradeoffs David Patterson, Chair ............................. 66
`Evaluation of Design Alternatives for a Multiprocessor Microprocessor ......................................... 67
`Basem A. Nayfeh, Lan.ce Ham.mond, and Kun!e Olukotun
`
`Memory Bandwidth Limitations of Future Microprocessors .................................................... 78
`Doug Burger, James R. Goodman, and Alain Kagi
`
`Missing the Memory Wall: The Case for Processor /Memory Integration ........................................ 90
`··
`Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk
`
`Panel: Research dpportunities and Critiques: An Indust~ial Perspective ............. 102
`David Patterson, organizer
`
`v
`
`004
`
`

`
`Session VA : Cache Organization
`Corinna Lee, Chair .......................................... 103
`Don't use the page number, but a pointer to it ................................................ , .............. 104
`Andre Seznec
`
`The Difference-bit Cache ..................................................................................... 114
`Toni Juan, Tomas Lang, and Juan J. Navarro
`
`Session VB : Application Implications for MP Systems
`Thomas Gross, Chair ........... 121
`Understanding Application Performance on Shared Virtual Memory Systems .................. ~ .............. 122
`Liviu lftode, Jaswinder Pal Singh, and Kai Li
`
`Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines ............... 134
`Chris Holt, Jaswinder Pal Singh, and John Hennessy
`
`Session VIA : Superscalar Memory Systems Matt Farrens, Chair ........................... 146
`Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors ...................... , ............. 147
`Kenneth M. Wilson, Kunle Olukotun, and Mendel Rosenblum
`
`High-Bandwidth Address Translation for Multiple-Issue Processors ............ ." ........... : . ................. 158
`Todd M. Austin and Gurindar S. Sohi
`
`Session VIB : 1/0 and Interrupts
`Jai Menon, Chair ............................................ 168
`DCD-Disk Caching Disk: A New Approach for Boosting 1/0 Performance ......................... , .......... 169
`\
`Yiming Hu and Qing Yang
`
`Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling .......................... 179
`Olivier Maquelin, Guang R. Gao, Herbert H. J. Hum, Kevin Theobald, and Xinmin Tian
`
`Invited Address ....... , ................................................................................ 189
`50 years ago: Presper Eckert, John Mauchly, and the Origins of Computer Architecture
`Maurice Wilkes
`
`Eckert-Mauchly Award ......................... , .................................................... 189
`Yale Patt, recipient
`
`\
`Session VIlA : Processor Microarchit.ecture
`James E. Smith, Chair ......................... 190
`Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous MultiThreading Processor . 191
`Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm
`
`Evaluation of Multithreaded Uniprocessors for Commercial Application Environments ................... , .... 203
`Rick J. Eickmeyer, Ross E. Johnson, Steve R. Kunkel, Shiafun Liu, and MarkS. Squillante
`
`Performance Comparison of ILP Machines with Cycle Time Evaluation ......... , ............................. 213
`Tetsuya Hara, Hideki Ando, Chikako Nakanishi, Masao Nakaya
`
`vi
`
`005
`
`

`
`Session VIlli : Networks Kai Li, Chair ........................................................... 225
`Rotating Combined Queueing (RCQ):
`Bandwidth and Latency Guarantees in Low-Cost, High-Performance Networks ................................ 226
`Jae H. Kim and Andrew A. Chien
`
`A Router Architecture for Real-Time Point-to-Point Networks ................................................ 237
`Jennifer Rexford, John Hall, and Kang G. Shin
`
`Coherent Network Interfaces for Fine-Grain Communication .................................................. 247
`Shubhendu S. Mukherjee, Babak Falsafi, Mark D. Hill, and David A. Wood
`
`Session VIII : Performance Evaluation and Optimization
`Patricia Teller, Chair ......... 259
`Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors ............... 260
`Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith
`
`\
`Instruction Prefetching of Systems Codes With Layout·Dptimized for Reduced Cache Misses .................. 271
`Chun Xia and Josep Torrellas
`
`Compiler and Hardware Support for Cache Coherence in Large-Scale Multiprocessors: Design Considerations and
`Performance Evaluation ...................................................................................... 283
`Lynn Choi and Pen-Chung Yew
`
`Session IX : Systems
`Steve Scott, Chair ............................................................ 295
`Early Experience with Message-Passing on the SHRIMP Multicomputer ...................................... 296
`Edward W. Felten, Richard D. Alpert, Angelos Bilas, Matthias A. Blumrich, Douglas W. Clark,
`Stefanos N. Damianakis, Cezary Dubnicki, Liviu Iftode, and K ai Li
`
`STiNG: A CC-NUMA Cobputer System for the Commercial Marketplace .................................... 308
`Tom Lovett and Russell Clapp
`
`Author Index ................................................................... _ ........................... 318
`
`vii
`
`006
`
`

`
`Author Index
`
`Anant Agarwal .................... .44
`Richard D. Alpert ................ 296
`Hideki Ando ...................... 213
`Todd M. Austin ................... 158
`Michel Banatre .................... 56
`Forest Baskett ...................... 1
`Angelos Bilas ..................... 296
`Matthias A. Blumrich ............. 296
`Doug Burger ....................... 78
`Po-Yung Chang ..................... 3
`J. Bradley Chen ................... 12
`Andrew A. Chien ................. 226
`Lynn Choi .. , ..................... 283
`Russell .Clapp ..................... 308
`Douglas W. Clark ................. 296
`Stefanos N. Damianakis ........... 296
`Cezary Dubnicki .................. 296
`Susan J. Eggers ................... 191
`Rick J. Eickmeyer ................. 203
`Joel S. Emer ...................... 191
`Marius Evers ........................ 3
`Babak Falsafi ..................... 247
`Edward W. Felten ................ 296
`Guang R. Gao .................... 179
`Alain Gefllaut ...................... 56
`Nicolas Gloy ....................... 12
`James R. Goodman ................ 78
`John Hall ......................... 237
`Lance Hammond ................... 67
`Tetsuya Hara ..................... 213
`John Hennessy .................... 134
`Mark D. Hill ...................... 247
`Chris Holt ........................ 134
`Mark Horowitz .................... 260
`Yiming Hu ........................ 169
`Herbert H. J. Hum ................ 179
`Liviu Iftode .................. 122, 296
`Ross E. Johnson .................. 203
`Toni Juan ....................... ~ 114
`Alain Kagi ........................ :78
`Anne-Marie Kermarrec ............. 56
`Jae H. Kim ....................... 226
`John Kubiatowicz .................. 44
`Steve R. Kunkel .................. 203
`Tomas Lang ...................... 114
`
`Chih-Chieh Lee .................... 22
`Henry M. Levy ................... 191
`Kai Li ....................... 122, 296
`Shiafun Liu ....................... 203
`Jack L. Lo ........................ 191
`Tom Lovett ....................... 308
`Olivier Maquelin .................. 179
`Margaret Martonosi ............... 260
`Christine Morin .................... 56
`Todd C. Mowry ................... 260
`Trevor N. Mudge ................... 22
`Shubhendu S. Mukherjee .......... 247
`Chikako Nakanishi ................ 213
`Masao Nakaya .................... 213
`Juan J. Navarro ................... 114
`Basem A. N ayfeh .................. 67
`Andreas Nowatzyk ................. 90
`Kunle Olukotun ............... 67, 147
`Yale N. Patt ......................... 3
`Robert W. Pfile .................... 34
`Fong Pong ......................... 90
`Steven K. Reinhardt ............... 34
`Jennifer Rexford .................. 237
`Mendel Rosenblum ................ 147
`Ashley Saulsbury ..... 1 . . . . . . . . . . . . . 90
`Stuart Sechrest ....... · ............. 22
`Andre Seznec ..................... 104
`Kang G. Shin ..................... 237
`Jaswinder Pal Singh .... · ...... 122, 134
`Michael D. Smith ............. 12, 260
`Gurindar S. Sohi .................. 158
`MarkS. Squillante ................ 203
`Rebecca L. Stamm ................ 191
`Kevin Theobald ................... 179
`Xinmin Tian ...................... 179
`J osep Torrellas .................... 271
`Dean M. Tullsen .................. 191
`Kenneth M. Wilson ............... 14 7
`David A. Wood ............... 34, 247
`Chun Xia ......................... 271
`Qing Yang ........................ 169
`Donald Yeung ...................... 44
`Pen-Chung Yew ................... 283
`Cliff Young ........................ 12
`
`318
`
`007
`
`

`
`Polling Watchdog: Combining Polling and Interrupts
`for Efficient Message Handling
`
`Olivier Maquelin, Guang R. Gao, Herbert H.J. Humt,
`Kevin B. Theobald, Xin-Min Tian ,
`
`School of Computer Science
`McGill University
`3480 University St.
`Montreal, Canada, H3A 2A 7
`
`t Dept. of Elec. and Comp. Eng.
`Concordia University
`1455 de Maisonneuve W.
`Montreal, Canada, H3G 1M8
`
`Abstract
`
`Parallel systems supporting multithreading, or message
`passing in general, have typically used either polling or in(cid:173)
`terrupts to handle incoming messages. Neither approach is
`ideal; either may lead to excessive overheads or message(cid:173)
`handling latencies, depending on the application. This pa(cid:173)
`per investigates a combined approach-Polling Watchdog,
`where both are used depending on the circumstances. The
`Polling Watchdog is a simple hardware extension that lim(cid:173)
`its the generation of interrupts to the cases where explicit
`polling fails to handle the message quickly. As an added
`benefit, this mechanism also has the potential to simplify
`the interaction between interrupts and the network accesses
`performed by the program.
`We present the resulting performance for the EARTH(cid:173)
`MANNA-S system, an implementation of the EARTH (Ef(cid:173)
`ficient Architecture for Ru'nning THreads) execution model
`on the MANNA multiprocessor. In contrast to the original
`EARTH-MANNA system, this system does not use a ded(cid:173)
`icated communication processor. Rather, synchronization
`and communication tasks are performed on the same pro(cid:173)
`cessor as the regular computations. Therefore, an efficient
`message-handling mechanism is essential to good perfor-.
`mance. Simulation results and performance measurements
`show that the Polling Watchdog indeed performs better than
`either polling or interrupts alone. In fact, this mechanism
`allows the EARTH-MANNA-S system to achieve the same
`level of performance as the original EARTH-MANNA mul(cid:173)
`tithreaded system.
`
`1
`
`Introduction
`
`Today, fast microprocessors are commonly used as building
`blocks in large multiprocessor systems, due to their excel(cid:173)
`lent price/performance ratios. Unfortunately, since these
`processors are driven by mass markets, they are mostly op(cid:173)
`timized for use in personal computers and workstations, and
`offer only limited support for interprocessor communication.
`Nevertheless, the high development costs of current .hard(cid:173)
`ware make it virtually impossible for custom hardware to
`Permission to make dlgitalittard copy of part or all of this work for personal
`or classroom use is granted without fee provided that copies are not made
`or distributed for profit or commercial advantage, the copyright notice, the
`title of the publication and Its date appear, and notice is given that
`copying Is by permission of ACM, Inc. To copy otherwise, 1o republish, lo
`post on servers, or lo redistribute lo lists, requlres prior specific permission
`and/or a fee.
`
`ISCA '96 5196 PA, USA
`C 1996 ACM o-89791-786-319610005 ... $3.50
`
`179
`
`comp;t~ with "killer IJlicros" [3], especially given the rela(cid:173)
`tively small size of the .market for supercomputing. If par(cid:173)
`allel architects have to "give in" and use off-the-shelf micro(cid:173)
`processors, then it is important to study how best to take
`advantage of their capabilities.
`yonsiderable gains in communication performance can
`be achieved by making the network interface accessible by
`user-level code [8) and by using lpw-latency message passing
`mechanisms such as Active Messages [20). As designers try
`to improve communication performance further, especially
`for short messages, ways must be found for the system to
`react to network events quickly and cost-effectively.
`We believe that the multithreaded program execution
`model has great potential for future parallel systems due to
`its ability to tolerate the communication and synchroniza(cid:173)
`tion latencies inherent in parallel architectures [1, 5, 7, 14,
`17). As long as there is enough parallelism in an applica(cid:173)
`tion, a multithreaded architecture can hide these latencies
`by switching to another thread. However, this necessitates
`the smooth integration of asynchronous events into the com(cid:173)
`putation [17). Since multithreaded systems face the same
`competitive pressures as other multiprocessors, it may be
`necessary to implement this integration using stock hard(cid:173)
`ware.
`In this paper, we present a message-handling mechanism
`designed for the EARTH-MANN A-S system, an implemen(cid:173)
`tation of the EARTH execution model [9, 10, 11] on the
`MANNA multiprocessor [4). This system, which is described
`in more detail in section 2.1, implements a multithreading
`environment on a machine based on Intel i860 XP RISC pro(cid:173)
`cessors. In contrast to the original EARTH-MANNA imple(cid:173)
`mentation [9, 10) which served as an emulation pla~form for
`our investigations of EARTH concepts, EARTH-MANN A-S
`does not use a dedicated communication co-processor. In(cid:173)
`stead, in each node a single CPU handles both computations
`and communications. Thus, the experiences of our EARTH(cid:173)
`MANNA-S system are generally applicable to multithreaded
`runtime systems implemented on conventional multiproces(cid:173)
`sors consisting of single-CPU nodes.
`On most machines in which a single CPU performs both
`functions, one of two mechanisms is 'USed to handle incom(cid:173)
`ing messages. Either there are explicit polls in the applica(cid:173)
`tion code (added by the user or compiler), or incoming mes(cid:173)
`sages generate interrupts in the CPU. However, using either
`polling or interrupts exclusively can be inefficient, espeCially
`when message traffic is high and its irregular pattern is not
`statically analyzable. If polling is used, but the message fre(cid:173)
`quency is much lower than the polling frequency, the polling
`
`008
`
`

`
`overhead will be incurred many times for each message. If,
`on the other hand, interrupts are used and the message fre(cid:173)
`quency is high, the interrupt costs can be significant, even
`if the interrupt handler is able to service all outstanding
`messages when it is invoked. There is a. definite need for a
`better solution, as the frequency of network events can vary
`dramatically from program to program and even during one
`program's execution.
`The approach described in this paper seeks to combine
`the advantages of both polling and interrupts. A simple
`extension to the hardware, called a Polling Watchdog, gen(cid:173)
`erates interrupts only when explicit polling fails to han(cid:173)
`dle incoming messages in a timely manner. In a conven(cid:173)
`tional OS, a watchdog timer may be used by the kernel to
`regulate interrupt generation in order to prevent a process
`from running too long [19]. Similar mechanisms have also
`been used in real-time systems to improve response time.
`The Polling Watchdog proposed in this paper is targeted
`toward general-purpose multiprocessor architectures with
`multithreaded program execution models. It is to be used
`in conjunction with normal user-level polling to improve the
`performance of message handling (data communication and
`synchronization) between multiple threads within one user
`application, and no OS intervention is required.
`While tlie hardware for the Polling Watchdog has not
`been implemented, we have modeled its behavior in two
`ways. In the first method, the second CPU of each MANNA
`node is used to approximate the Polling WatChdog's func(cid:173)
`tion. The second method uses a detailed cycle-by-cycle
`simulation of EARTH-MANNA-S with a Polling Watchdog
`added (see section 3.3).
`In this paper, we make the following contributions:
`
`• We propose a combined approach for efficient message
`handling. The Polling Watchdog uses interrupts, but
`limits them to the cases where explicit polling fails to
`handle messages quickly.
`
`• In the presence of interrupts, some kind of locking mech(cid:173)
`anism is needed to guarantee atomic access to shared re(cid:173)
`sources. We propose a simple solution for that problem
`under our scheme.
`
`• We compare experimentally the per-message overheads
`and latencies of several message-handling mechanisms.
`Our experiments cover both the simulation of a hard(cid:173)
`ware implementation of the Polling Watchdog and the
`current implementation in the EARTH-MANNA-S sys(cid:173)
`tem. Among our findings:
`
`-
`
`In terms of latency, the Polling Watchdog clearly
`achieves better performance than a naive polling-only
`approach, and the same, performance as an optimal
`poJ?ng-only solution.
`\
`- The overhead per messal?;e of our scheme is co.nsid(cid:173)
`erably lower than for an interrupt-based implementa(cid:173)
`tion, especially in the presence of fine-grained compu(cid:173)
`tations. Polling-only approaclles can yield even lower
`overheads, but only at the cost of programmer inter(cid:173)
`vention and with a higher sensitivity to variations of
`the message-arrival rate.
`
`• We also demonstrate the performance gains achieved
`for a reeresentative set of benchmark programs. These
`results are very encouraging. Although the EARTH(cid:173)
`MANNA-S system does not have a dedicated commu(cid:173)
`nication processor, it is able to achieve the same level of
`performance as the original EARTH-MANNA system.
`
`• We give some preliminary guidelines on how to choose .
`a timeout value for the Polling Watchdog. The effect
`of the timeout value on execution time is shown for the
`same set of benchmark programs.
`
`The rest of the paper is organized as follows: the next
`section briefly describes the EARTH model and the EARTH(cid:173)
`MANNA-S system, in particular the challenges facing de(cid:173)
`signers of such a system. Section 3 describes the Polling
`Watchdog mechanism, as well as the implications for guar(cid:173)
`anteeing atomicity. Section 4 discusses the performance of
`several message handling mechanisms, based both on exper(cid:173)
`iments with the EARTH-MANNA-S system and on simula(cid:173)
`tion results. Then, section 5 shows the performance gains
`achieved for a representative. set of benchmark programs and
`compares these results with the performance of the origi(cid:173)
`nal EARTH-MANNA system. Finally, related work is men(cid:173)
`tioned and some conclusions are drawn.
`
`2 The EARTH-MANNA-5 System
`
`The issues discussed in this paper were encountered while
`mapping the EARTH model to a machine with only a single
`CPU per node. Both the original EARTH-MANNA system
`and the EARTH-MANNA-S system run on the MANNA
`multiprocessor, which has two CPUs per node. However,
`while the EARTH-MANNA implementation dedicates one of
`them to communication and synchronization, the EARTH(cid:173)
`MANNA-S system uses only a single CPU. This second im(cid:173)
`plementation was made in order to assess the usefulness of a
`dedicated communication processor and to study the prob(cid:173)
`lems involved in porting the EARTH model to single-CPU
`node architectures.
`
`2.1 The EARTH Programming Model
`I
`EARTH (Efficient Architecture fpr Running THreads) [9]
`supports a multithreaded program execution m9del in which
`code is divided into threads that are scheduled atomically
`usi~g dataflow-like synchronization operations [9, 10, 11].
`These "EARTH operations" comprise a rich set of prim(cid:173)
`itives, including remote loads and stores, synchronization
`operations, block data transfers, remote function calls and
`dynamic load balancing. EARTH operations are initiated
`by the threads themselves. Once a thread is started, it runs
`to completion, and instructions within it are executed in
`sequential order. 1 Therefore, a conventional processor can
`execute a thread efficiently, even when the thread is purely
`sequential. For this reason, it is possible to ~btain single(cid:173)
`node performance close to that of a purely sequential imple(cid:173)
`mentation, as shown in our earlier work and recapitulated
`in section 5.
`Conceptually, each EARTH node consists of an Execu(cid:173)
`tion Unit (EU), which executes the threads, and a Synchro(cid:173)
`nization Unit (SU), which performs the EARTH operations
`requested by the threads (see figure 1). The current hard(cid:173)
`ware designs for EARTH use an off-the-shelf high-end RISC
`processor for the EU and custom hardware for the SU. How(cid:173)
`ever, other implementations are also possible.
`The EARTH model was implemented on the MANNA [4]
`architecture. Each MANNA node contains two Intel i860 XP
`RISC 'CPUs, 32MB of local memory and a 50 MB/s bidi(cid:173)
`rectional network interface. The inten;Qnnection network
`is based on 16 X 16 crossbars, which support the full link
`
`1 Instructions may be executed out of order, as on a superscalarma(cid:173)
`chine, as long as the semantics of the sequential ordering are obeyed.
`
`180
`
`009
`
`

`
`-------------------·
`.--------, :
`
`L-;...------r_J :
`
`I
`I
`
`I
`I
`I
`' I
`I
`I
`I
`I
`I
`I
`
`' I
`' ' I
`
`- - - - - - - - - - - - - - - - __ J
`
`rowed down to the following:
`Is it possible to keep both
`the per-message overhead and the latency of message han(cid:173)
`dling to sufficiently low values on machines with a single
`off-the-shelf microprocessor per node? Low latencies can be
`achieved with interrupts, but on today's RISC CPUs this
`mechanism is quite expensive. The lowest overheads are
`achieved with explicit polling, but how do we then guaran(cid:173)
`tee short latencies without compromising sequential execu(cid:173)
`tion performance? The rest of this paper will address that
`question by describing the message handling mechanism im(cid:173)
`plemented in the EARTH-MANN A-S system.
`
`Network
`
`3 The Polling Watchdog
`
`Figure 1: The EARTH Architecture
`
`bandwidth. Because two CPUs are available per node, one
`of them is used as the EU, executing code compiled from
`the user program, while the other CPU runs a fixed set of
`routines that emulate the functions of the SU. The resulting
`system is called EARTH-MANNA [9, 13).
`
`2.2 Mapping EARTH to a Single CPU
`
`Since most multiprocessor systems only have a single CPU
`per node, an obvious question was whether the EARTH
`model can be implemented with competitive performance
`and efficiency on such machines, and what key issues and
`challenges must be addressed. To answer these questions,
`we did another implementation of the EARTH model on
`MANNA called EARTH-MANNA-8, which uses only one
`of th~ tw'o CPUs to perform the functions of both the EU
`and SU.
`With only a single CPU to execute both the program
`code and the multithreadirlg support code, it is necessar~ t.o
`find an efficient way to switch from one to the other. For Ini(cid:173)
`tiating external requests and even for context switching this
`·is not too much of a problem, as this occurs at specific loca(cid:173)
`tions in the code. The operations can therefore be expanded
`inline or implemented as function calls. Efficient user-level
`access to the network can dramatically reduce the overheads
`for message sends [20) and the costs of context switching
`can also be kept quite low with appropriate compiler sup(cid:173)
`port [13). However, asynchronous reception of messages is
`more difficult, as it has to be merged into the flow of control
`dynamically, at run-time. This problem is not unique to the
`EARTH model; others have encountered similar difficulties
`[17).
`Two main factors determine the efficiency of a message
`handling mechanism: the overhead per message and the av(cid:173)
`erage handling latency (time to respond to an incoming mes(cid:173)
`sage). Low per-message overheads are crucial when large
`numbers of short messages are exchanged. This can eas(cid:173)
`ily be the case with the EARTH-MANNA-S system, as the
`efficient support for multithreading makes it possible to ex(cid:173)
`ploit parallelism at a fine level of granularity, down to thread
`lengths of just a few hundreds of instructions. Low response
`times are important when there is little parallelism. Multi(cid:173)
`threading systems are able to tolerate long response times
`by running other threads while the communication is in
`progress, but only if enough threads are available to hide
`all the delays.
`The questions mentioned above can therefore be nar-
`
`Explicit polling and interrupts both have their strengths and
`weaknesses. Polling has been used for many message-passing
`implementations designed to support small messages effi(cid:173)
`ciently, such as the Active Messages implementation on the
`CM-5 [17). However, interrupts have the disti~ct advant~ge
`of being generated only when a message arnves. Poll m(cid:173)
`structio,ns are always executed, whether or not a message
`is there 'to be processed. Therefore, polling becomes waste(cid:173)
`ful when the frequency of messages is much lower than the
`polling frequency. Addressing this problem simply by re(cid:173)
`ducing the polling frequency does not always lead to better
`performance, because even when the number of messages
`is low, it might be necessary to respond to these messages
`quickly in order to achieve good overall efficiency.
`This may not be a problem for systems such as the TAM
`implementation on the CM-5 [6) (we call this TAM/CM-5),
`where polling once per thread is sufficient due to the fine
`granularity of that model. In that particular case, polling
`can be integrated with the other multithreading support op(cid:173)
`erations. However, for other models such as EARTH, where
`larger threads are common, polling once per thread is not
`always sufficient. Additional polling instructions could be
`added at appropriate locations in the code, either by the pro(cid:173)
`grammer or by the compiler. However, this can be diffi~ult
`to implement in practice. If at all possible, such machme(cid:173)
`specific issues should be ·hidden from the user.
`
`3.1 . Mixing Polling and Interrupts
`If polling and interrupts both have advantages and disadvan(cid:173)
`tages, would it be possible to let th~ machine switch from one
`mode to the other at run-time, depending on circumstances?
`To answer this question, we developed the P?lling Watch(cid:173)
`dog mechanism. The EARTH-MANN A-S system polls the
`network during context switches, as is done, for instance, in
`TAM/CM-5. When a message arrives, a timer starts count(cid:173)
`ing. If the message is not removed frbin the network t~rough
`polling within a given amount of. time (the watchdog tlmeout
`Twdog), the watchdog interrupts the CPU.2 This is shown in
`the state diagram in figure 2, in which transitions occur ev(cid:173)
`ery clock cycle (Twdog is measured in clock cycles).
`Figure 3'illustrates the Polling Watchdog in action. In
`this example, a message arrives at time t1 and is handled
`through polling at time t2. At time t3 a second message
`arrives, which is handled through an interrupt at time t4
`because the network was not polled quickly enough. Note
`that the watchdog timer is reset, but continues to count until
`the message is actually removed from the network. If the
`interrupt code fails to empty the network buffers, another
`
`2 A similar mechanism can be seen in conventional operating sys(cid:173)
`tems, in which a watchdog timer is used to prevent a process from
`running too long [19].
`
`181
`
`010
`
`

`
`limit for Twa.og, but what about the upper limit? That limit
`depends on the amount of independent computations that
`can be performed after a message has been sent. If enough
`parallelism is available, a multithreaded system can hide
`substantial delays without noticeable performance degrada(cid:173)
`tion. On the other hand, if no parallelism is available it
`might be necessary to set Twa.og to a very low value to get
`optimal performance. The optimal timeout value will in
`general depend on the program. However, our e

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket