`
`IEE. Proceedings
`
`COMPUTERS AND DIGITAL TECHNIQUES
`
`The Institution of Electrical Engineers, Savoy Place, London WC2R OBL, United Kingdom
`Publishing Department: Michael Faraday House, Six Hills Way, Stevenage, Herts. SG1 2AY,
`United Kingdom
`
`November 1995
`
`Volume 142
`
`Number 6
`
`UK ISSN 1350-2387
`
`ICDTEA 142 (6) 377-440
`
`HONORARY EDITORS
`Prof. E.L. Dag less (University of Bristol)
`Prof. C.R. Jesshope (University of Surrey)
`
`PAPERS
`Papers should be addressed to the Managing Editor, IEE Proceedings-Computers and Digital Techniques, Publishing Depart(cid:173)
`ment, Institution of Electrical Engineers, Michael Faraday House, Six Hills Way, Stevenage, Herts. SG1 '2.AY, United Kingdom;
`telephone 01438 313311; telex 825578 IEESTV G; facsimile 01438 742849 . Notes for the guidance of authors may be found
`on the inside back cover.
`
`Papers of outstanding merit are elig ible for IEE premiums.
`
`SUBSCRIPTIONS
`There are eleven IEE Proceedings Titles. Si x issues of each Title are published each year.
`
`Annual subscription prices for 1995 (paper or microfiche , including delivery):
`
`1 Title
`2 Titles
`3 Titles
`4 Titles
`5 Titles
`6 Titles
`7 Titles
`8 Titles
`9 Titles
`10 Titles
`11 Titles
`
`[340.00
`[520.00
`[630.00
`[756 .00
`£835.00
`[900.00
`[938.00
`[968.00
`[990.00
`[1010.00
`[1034.00
`
`Each additional Title subscription in excess of eleven Titles : £94.00.
`
`Single copy: £58.00. Airmail supplement: [20.00 per Title.
`Orders should be addressed to IEE Publications Sales Department, Michael Faraday House, PO Box 96, Stevenage, Herts. SG1
`2SD, United Kingdom.
`Te!ephone: 01438 313311 ; telex 825578 IEESTVG; facsimile: 01438 742792.
`
`ADVERTISING
`Inquiries should be sent to the IEE Advertising Department at the Stevenage address.
`
`COPYRIGHT AND COPYING
`This publication is copyright under the Berne Convention and the Universal Copyright Convention. All rights reserved. Apart
`from any copying under the UK Copyright, Designs and Patents Act 1988, Part 1, Section 38, whereby a single copy of this
`article may be supplied, under certain conditions, for the purposes of research or private study, by a library of a class prescribed
`by The Copyright (Librarians and Archivists) (Copying of Copyright Material) Regulations 1989: SI 1989/ 1212, no part of this
`publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means without the prior
`permission of the copyright owners. Permission is, however, not required to copy abstracts of individual contributions on
`condition that a full reference to the source is shown.
`
`Single copies may be made for the purpose of resea rch o r private study .
`
`Authorisation to photocopy items for internal or personal use, or for the internal or personal use of specific clients, is granted by
`the Institution of Electrical Engineers for libraries and other users registered with the Copyright Clearance Center Transactional
`Reporting Service, provided that the base fee of $10.00 per copy is paid directly to the Copyright Clearance Centre Inc., 21
`Congress Street, Salem, MA 01970, USA (1350-2387 /95 $10.00). This consent does not extend to other kinds of copying,
`such as copying for general distribution, for advertising or promotional purposes, for creating new collective works, or for
`resale.
`
`Multiple copying of the content s of t his pu blicat ion w it hout per mission is always illegal.
`
`The IEE is not as a body responsible for the opinions expressed by individual authors in IEE Proceedings-Computers and
`Digital Techniques.
`The IEE is a member of the Association of Learned & Professional Society Publishers.
`© 1995: The Institution of Electrical Engineers.
`
`Secretary of the IEE: J.C. Williams, PhD, FEng, Fl EE
`Managing Editor: Gill Wheeler; Executive Editor: Julia Hubble
`
`Typeset by: Pindar pie, York, United Kingdom
`Printed by: Black Bear Press Limited, Cambridge, United Kingdom
`
`
`
`~
`IEE
`IEE PROCEEDINGS
`COMPUTERS AND
`DIGITAL TECHNIQUES
`
`November 1995
`
`Volume 142
`
`Number 6
`
`UK ISSN 1350-2387
`
`Contents
`
`ICDTEA 142 (6) 377- 440
`
`Anatomy of a simulation backplane. M. Zwolinski,
`C. Garagate, Z. Mrcarica, T.J. Kazmierski and A.D. Brown
`
`Adaptive wormhole routing in tori with faults.
`S. Chalasani and R.V. Boppana
`
`Deadlock-free wormhole routing algorithms for star
`graph topology. C.P. Ravikumar and A.M . Goel
`
`Technique to minimise area overhead for delay-driven
`clustering. C. Yeh and Y.-Y. Gu
`
`Effects of technology mapping on fault-detection coverage
`in reprogrammable FPGAs. K. Kwiat, W. Debany
`and S. Hariri
`
`Design and hardware architectures for dynamic Huffman coding.
`L.-Y. Liu, J.-F. Wang, R.-J . Wang and J.-Y. Lee
`
`Novel design of arithmetic coding for data compression.
`J. Jiang
`
`Concurrent error detection in array multipliers by BJl)O.
`T.-H. Chen, Y.-P. Lee and L.-G . Chen
`" ··
`
`Spatial bounding of complex CSG objects.
`A. Sanna and P. Montusch
`
`.. :.::::~-
`
`page
`
`377
`
`386
`
`395
`
`401.
`
`4g7
`
`411
`
`419
`
`425
`
`431
`
`
`
`Adaptive wormhole routing in tori with faults
`
`S.Chalasani
`R.V.Boppana
`
`Indexing terms:_ Adaptive routing, Deadlocks, Fault-tolerant routing, Multicomputer networks, Message routing, Performance evaluation, Torus networks,
`Wormhole routing
`
`to
`Abstract: The authors present a method
`for
`enhance wormhole
`routing
`algorithms
`deadlock-free fault-tolerant routing in tori. They
`consider arbitrarily-located faulty blocks and
`assume only local knowledge of faults. Messages
`are routed via shortest paths when there are no
`faults, and this constraint is only slightly relaxed
`to facilitate routing in the presence of faults. The
`key concept used is that, for each fault region, a
`fault ring consisting of fault free nodes and
`physical channels can be formed around it. These
`fault rings can be used to route messages around
`fault regions. We prove that, at most, four
`additional virtual channels are sufficient to make
`any fully-adaptive algorithm tolerant to multiple
`faulty blocks
`in
`torus networks. Simulation
`results
`are presented
`for
`a
`fully-adaptive
`algorithm showing that good performance can be
`obtained with as many as 10% links faulty.
`
`1
`
`Introduction
`
`Point-to-point torus and related networks are being
`used in many recent experimental and commercial mul(cid:173)
`ticomputers and multiprocessors [1-3]. A (k,n)-torus
`network has an n-dimensional grid structure with k
`nodes (processors) in each dimension such that every
`node is connected to two other nodes in each dimen(cid:173)
`sion by direct communication links. The wormhole
`(WH) switching technique by Dally and_ Seitz [8] has
`been used in many recent multicomputers [1-4]. In the
`WH technique, a packet is divided into a sequence of
`fixed-size units of data, called flits. If a communication
`channel transmits the first flit of a message, it must
`transmit all the remaining flits of the same message
`before transmitting flits of another message. To avoid
`deadlocks among messages, multiple virtual channels
`are simulated on each physical channel and a prede(cid:173)
`fined order is enforced on the allocation of virtual
`channels to messages.
`the
`in
`For fault-free networks, important issues
`design of a routing algoritlim are high throughput,
`low-latency message delivery, avoidance of deadlocks,
`
`© IEE, 1995
`IEE Proceedings online no. 19952079
`Paper first received 5th January 1995 and in revised form I 0th May 1995
`S. Chalasani is with the Electrical & Computer Engineering Department,
`University of Wisconsin-Madison, Madison, WI 53706-1691, USA
`R.V. Boppana is with the Computer Science Division, The University of
`Texas at San Antonio, San Antonio, TX 78249-0664, USA
`
`livelocks and starvation and ability to work well under
`various traffic patterns. Given a network with faults,
`our approach is to use the existing network rather than
`recreate the original network using spare nodes and
`links. Therefore, for networks with faults, a routing
`algorithm should exhibit the following additional fea(cid:173)
`tures: graceful degradation of performance and ability
`to handle faults with only a small increase in routing
`complexity and local knowledge of faults.
`The well-known e-cube or dimension-order routing
`algorithm is an example of nonadaptive routing algo(cid:173)
`rithms, since always a particular path is used in routing
`messages between a pair of nodes even when multiple
`shortest paths are available. With the e-cube, even a
`single fault disrupts communication between multiple
`pairs of nodes. With increase in adaptivity, a message
`is more likely to find a less congested path or fault-free
`path. Therefore, the issue of adaptivity, the extent of
`choice in selecting a path between a pair of nodes in
`routing a message, plays an important role in designing
`fault-tolerant routing algorithms.
`
`1. 1 Description of the problem and results
`We present a technique to enhance minimal, fully(cid:173)
`adaptive routing algorithms for fault-tolerant routing
`in tori. A minimal fully-adaptive algorithm routes mes(cid:173)
`sages along any of the shortest paths available. We
`consider routing methods that use only local knowl(cid:173)
`edge of faults. We assume that faulty processors are
`confined to one or more rectangular blocks.
`For each faiJlt region, there exist one or more paths
`that pass through fault-free nodes and links and encir(cid:173)
`cle the fault. For a fault in a 2D torus, there is an undi(cid:173)
`rected ring of fault-free nodes and links; we refer to
`this ring as fault-ring. In this paper, we show that fault
`rings can be used to route messages around the fault
`regions using only local knowledge of faults and with(cid:173)
`out introducing deadlocks and livelocks.
`
`1.2 Related results
`Adaptive, fault-tolerant routing algorithms for WH
`and virtual cut-through switching techniques has been
`the subject of extensive research in recent years [5-10].
`Reddy and Freitas [11] use global knowledge of faults,
`spare nodes, and routing tables to investigate the per(cid:173)
`formance 'limitations caused by faults. Gaughan and
`Yalamanchili [12] use a pipelined circuit-switching
`mechanism with backtracking for fault-tolerant rout(cid:173)
`ing. These two results are applicable to networks with
`arbitrarily-shaped faults. Our interest in this paper, is
`to design fault-tolerant wormhole routing algorithms
`that can be applied with local knowledge of faults. One
`important criterion is that the fault-free performance
`should not be sacrificed for fault-tolerant routing.
`
`386
`
`IEE Proc.-Comput. Digit. Tech., Vol. 142, No. 6, November 1995
`
`
`
`There are no previous results speciftcally on fault-tol(cid:173)
`erant wormhole routing in tori. Often, the results devel(cid:173)
`oped for meshes [5,8,13] can be extended to tori with
`suitable modiftcations, since meshes and tori are closely
`related. The wraparound links in tori lead to extra
`deadlock possibilities, however. Therefore, if the results
`developed for meshes are applied with few changes, the
`number of virtual channels required to avoid deadlocks
`may be doubled [5]. Furthermore, meshes have edges,
`for example, the top row in a 2D mesh, and faults on
`edges are complicated to handle [13]. But this case
`never arises in tori, since they are node symmetric.
`Hence, extending efficient mesh routing techniques to
`tori in a straight forward manner may not necessarily
`yield efficient routing algorithms for the latter net(cid:173)
`works.
`In terms of adaptivity and performance comparisons,
`the results by Dally and Aoki [8] are the most relevant
`to ours. With the dimension-reversal schemes of Dally
`and Aoki, a message may lose its adaptivity, if its
`number of dimension reversals equals the number of
`virtual channel classes. A message that has lost adap(cid:173)
`tivity is routed by the e-cube algorithm and is not guar(cid:173)
`anteed to be delivered to its destination if there are
`faults in the network. Thus the number of virtual chan(cid:173)
`nels needed and the number of faults tolerated is highly
`dependent on the number of virtual channels and the
`location of faults. In contrast, our algorithms can toler(cid:173)
`ate any number and combination of rectangular faulty
`blocks with simple logic, and require only four virtual
`channels more than that required for the original adap(cid:173)
`tive algorithm. (Throughout this paper we indicate the
`number of virtual channels on per physical channel
`basis). This result compares well with our earlier result
`that four extra virtual channels are sufficient for rout(cid:173)
`ing in meshes with faults [13].
`
`2 Preliminaries
`
`A (k,n)-torus (also called k-ary n-cube) has n dimen(cid:173)
`sions, numbered from O to (n-1), and N = k" nodes.
`Each node is uniquely indexed by an 11-tuple in radix k.
`Each node is connected via communication links to
`two other nodes in each dimension. The neighbours of
`the node x = (x11.1, •.• , x0) in dimension i are (x,,.i, ... ,
`xi+I, xi± I, xi-I··· , x0), where addition and subtraction
`are modulo k. A link is said to be a wraparound link if
`it connects two neighbours whose addresses differ by
`k- 1 in dimension i, 0 :s; i < n. A (k, n)-mesh is a (k, 11)(cid:173)
`torus with the wraparound connections missing. We
`assume that each communication link represents two
`unidirectional physical communication channels. The
`link .between nodes x and y is denoted by <x, y>. To
`simplify presentation, we discuss the concept of fault(cid:173)
`rings for two dimensional (2D) tori. We label the sides
`of a 2D torus as North, South, East and West.
`
`2. 1 Fault model
`We consider both node and link faults . All the links
`incident on a faulty node are considered faulty. We
`assume that faults are permanent and nonmalicious
`faults. Therefore, messages are generated by and for
`nonfaulty processors only. We develop fault-tolerant
`algorithms, for which it is sufficient if each nonfaulty
`processor knows the status of the links incident on it.
`A fault set is delined as the set F of faulty nodes and
`links. For example, the fault-set F = {(3,3), (3,4), (4,3),
`( 4,4), <(0,0),(0, I)>, <(1,2),(2,2)>} represents four node
`
`faults and two link faults in the two-dimensional net(cid:173)
`work shown in Fig. I. We assume that faults in a 2D
`torus have rectangular shapes. A set F of faulty nodes
`and links in a 2D torus is said to have a rectangular
`shape if there is a rectangle in the torus such that: (a)
`there are no faulty components on the boundary of the
`rectangle, (b) the interior of the rectangJet includes all
`faulty components in F and (c) the interior of the rec(cid:173)
`tangle contains no component that is not present in F.
`
`(0,0)
`
`.... - .................................... ···r·······--······· ...................... (}··············· ···-;-·
`!
`i
`•
`•
`I
`'
`I
`!
`!
`........................ ..................... ··-j········· .. ······ ·-·: ................. ····t···-··········· .... ; ..
`
`(0,1)
`
`(1,0)
`
`(1 ,1)
`
`(1,3)
`!
`l
`!
`!
`,
`I
`........ ················ .... ······················l ·······-······· ····'···-············ ···+·······-···-· .... : ..
`
`(1,2)
`
`(2,1)
`
`(2,,Z)
`
`(2t3)
`
`,,. •••• •• •••••••• • •••••• • • •· •••••••••••••••• ••••L •••••• • •••• •• •• • •• ••• L••••••• • •••••••••••••J.• ••• •• ••••••u•• ,, . l,,
`
`I
`
`I
`
`(2,,5)
`.
`I
`I
`!
`················· .... ················ ····f······················l··········· .. ·········i'················ ···!-·
`• I
`• i
`'-'"'>-4---0-+--__,..., i
`•
`i
`(3,2)
`(3,3)i
`(3,4)i
`I
`!
`!
`• \
`-.........>--t---o-+----o j
`j
`•
`j
`(4,4)1
`(4,2)
`(4,3)j
`I
`:
`J
`i
`i
`i
`:
`I
`:
`....... ..................... ········-······ ····r··-···········-·····-:---············ao••••·i-················ •••:••
`I
`I
`!
`i
`'
`
`(5,0)
`
`(5,1)
`
`(5,2)
`
`Fig. 1 Three fault regions and their associated f ault rings in a 6 x 6 torus
`
`For example, the set { (3,3),(3,4),( 4,3),( 4,4)} of faulty
`nodes shown in Fig. 1 is rectangular, since the interior
`of the rectangle - with corners (2,2),(2,5),(5,2), and
`(5,5) - includes all faulty components in F and no non(cid:173)
`faulty component (recall that a processor fault implies
`that all links incident on it are faulty). However, the set
`of faulty links {<(I, I ),(1,2)>, <(1,2),(2,2)>, <(2,2),
`(2,1)>, <(2,1),(1,1)>} in a 6 x 6 torus is not rectangu(cid:173)
`lar, since any rectangle with nonfaulty elements on its
`boundary contain~ at least one-element not in F. The
`faulty link <(1,2), (2,2)> is an example of a rectangular
`fault region, since the interior of the rectangle with cor(cid:173)
`ners (1 ,1), (1,3),(2,1), and (2,3) contains only the faulty
`link. The faulty link <(0,0),(0, 1 )> in Fig. 1 is consid(cid:173)
`ered rectangular; the rectangle that covers the faulty
`link has processors ( 1,0),(1 , 1 ),( 5, l) and ( 5,0) as its cor(cid:173)
`ners.
`An ;:region is the fault region of the torus given by a
`block-fault. Under the block-fault model, the fault-set
`in a 2D torus can be written as a union of disjoint
`smaller fault sets, each of which denotes an /-region.
`For example, the fault set F in Fig. I is in fact the
`union of three disjoint /-regions { (3,3),(3 ,4),( 4,3),( 4,4)},
`{<(0,0),(0,1)>} and {<(1 ,2),(2,2)>}. We assume that
`faults do not disconnect the network.
`There are many reasons to consider block faults.
`First, they model several common fault scenarios such
`as faults of isolated nodes and links and consecutive
`nodes in a row or column. Second, an arbitrarily(cid:173)
`shaped fault can be modelled as a block-fault, albeit by
`labelling some nonfaulty processors and/or links as
`
`t Interior of a rectangle is defined as the set of processors and links that
`are not on the boundary of the rectangle.
`
`IEE Proc.-Comput. Digit. Tech. . Vol. 142, No. 6, No vember 1995
`
`387
`
`
`
`faulty [5]. Finally, the block fault model accurately
`models faults at the chip, multichip module, and board
`levels.
`
`2.2 Fault rings
`Conceptually, fault regions may be considered as
`islands of faults in a sea of communication channels
`and nodes. In the same manner as a ship is navigated
`around an island, it should be feasible to route a mes(cid:173)
`sage around fault regions. For this purpose, we use the
`concept of fault rings, denoted /-rings.
`For each f-region in a network with faults, it is feasi(cid:173)
`ble to connect the fault-free components around the
`region to form a ring or chain. This is the fault ring for
`that region and consists of the fault-free nodes and
`channels that are adjacent (row-wise, column-wise or
`diagonally) to one or more components of the fault
`region. The /-ring of a block-fault has rectangular
`shape. For example, the f-ring of the node fault region
`{(3,3),(3,4),(4,3),(4,4)} in Fig. 1 passes through the
`fault-free nodes
`(2,2),(2,3),(2,4),(2,5),(3,5),( 4,5),( 5,5),
`(5,4),(5,3),(5,2),(4,2),(3,2) as shown in Fig. 1. The/-ring
`associated with the link fault region { <(1,2),(2,2)>} has
`nodes {(iJ) I 1 :;; i :;; 2, 1 :;; j :;; 3} in its perimeter. The/(cid:173)
`ring for the faulty link <(0,0),(0,l)> has nodes (1,0),
`(0,0),(5,0),(5,1),(0,l) and (1,1) on its perimeter.
`A fault-free node is in the /-ring only if it is at most
`two hops away from a faulty node or is adjacent to a
`node with a faulty-link incident on it. There can be sev(cid:173)
`eral fault rings, one for each /-region, in a faulty net(cid:173)
`work with multiple faults. Up to two f-rings in a 2D
`torus may have a common link, and up to four f-rings
`may have a common node. For example, nodes
`(2,2),(2,3) and the link between them are common to
`two !-rings in Fig. 1. A set of fault rings are said to
`overlap if they share one or more links.
`An /-ring represents a two-lane path to a message
`that needs to go through the /-region contained by the
`f-ring. Thus, an f-ring simulates four paths to route
`messages in two dimensions. Depending on the size of
`thef-region, physical channels in anj:ring may need to
`handle a large amount of traffic compared to the other
`fault-free physical channels. Further, routing messages
`around one or more fault-rings creates additional pos(cid:173)
`sibilities for deadlocks. Hence, wormhole routing algo(cid:173)
`rithms must be designed
`to handle additional
`congestion and deadlocks caused by faults.
`
`When a fault occurs, the f-ring around it can be
`formed in a distributed manner using a two-step proc(cid:173)
`ess. In the first step, each processor that detected a
`faulty link sends this message to its neighbours in other
`dimensions. Based on the set of messages received, each
`node that is to be on the f-ring determines its neigh(cid:173)
`bours in the j:ring. For more details, the reader is
`referred to [13].
`
`3
`
`Fault-tolerant routing algorithms for 2D tori
`
`In this Section, we present techniques using which any
`fully-adaptive routing algorithm can tolerate multiple
`rectangular fault regions in a 2D torus. Our approach
`is to use a known adaptive algorithm as much as possi(cid:173)
`ble to route messages. When a message is blocked+ by a
`fault and the adaptive algorithm cannot handle it, the
`additional routing logic described here will be used to
`route the message around the fault.
`The fully-adaptive algorithms we consider have the
`following property: if a message, upon arriving at an
`intermediate node, cannot find an idle channel for its
`next hop, then it can choose any one of the channels
`allowed by the routing logic and wait for that channel
`indefinitely without causing deadlocks. The adaptive
`algorithms based on Duato's theory [6] do not satisfy
`this, since they require the message in the above exam(cid:173)
`ple to rechoose a channel, to avoid deadlocks, after
`waiting for a finite amount of time.
`The following lemma forms the basis for the result
`presented in this Section. For the most part of the Sec(cid:173)
`tion, we assume that the faults in a torus are such that
`the resulting f rings are nonoverlapping. However, at
`the end of the Section, we indicate how fault-tolerant
`routing can be achieved with overlapping/-rings.
`Lemma I: Consider a 2D torus with ~ultiple rectangu(cid:173)
`lar fault blocks. Suppose that a message with destina(cid:173)
`tion dis being routed in the torus using a fully-adaptive
`routing algorithm. If the message is blocked at a node,
`say x, then the addresses of x and d differ in exactly
`one dimension.
`Proof We prove this by contradiction. Assume that the
`message is blocked at node x and that x and d differ in
`both dimensions. It can be easily verified that under the
`
`+ A message is said to be blocked by faults at node x if there is no fault(cid:173)
`free link <x, y> such that the hop from x to y is along the shortest path
`from x to d.
`
`Table 1: Virtual channels and F-ring orientations used by affected
`messages
`
`conditions satisfied
`
`virtual
`channel
`
`F-ring orientation
`
`clockwise
`
`counter-clockwise
`
`clockwise
`
`counter-clockwise
`
`clockwise
`
`counter-clockwise
`
`clockwise
`
`message
`type
`o+M
`o+w
`o-M
`o-w
`1+M
`
`1+w
`
`1-M
`
`Co
`
`Co
`c,
`c,
`
`C2
`
`C2
`
`C3
`
`d1 = x 1 and d0 > x0 and (d0 - x0 ):;; Lk/2J
`d1 = x 1 and d0 > x0 and (d0 - x0 ) > Lk/2J
`d1 = x 1 and x0 > d0 and (x0 - d0 ) :;; Lk/2J
`d1 = x 1 and x0 > d0 and (x0 - d0 ) > Lk/2J
`(d1 - x 1) dk/2J and d 1 > x 1 and d0 = x0
`(d1 - x 1) > Lk/2J and d 1 > x 1 and d0 = x0
`(x1 - d1):;; Lk/2J and d 1 < x 1 and d0 = x0
`(x1 - d1) > Lk/2J and d 1 < x 1 and d0 = x0
`counter-clockwise
`rw
`C3
`x = (x1, x0 ) is the node at which the message is affected and d = (d1, d0 ) is its destination
`
`388
`
`IEE Proc.-Comput. Digit. Tech .. Vol. 142, No. 6, November 1995
`
`
`
`/* Uses a generic fully-adaptive algorithm ;J*/
`Procedure Fully-Adaptive-2D(M}
`/* Uses four additional virtual channels c0, c1, c2, c3 * /
`Rule 1: If Mis unaffected, reserve vitual chanels and links according to r.
`Rule 2: If Mis a o•-message, route it using virtual channel c0 for the rest of its journey. Virtual
`channel c1 is used exclusively for 0--messages, c2 for 1 •-messages and c3 for 1 ·-messages.
`Let d be the dimension in which the message was blocked when it was affected. Let d' be the
`other dimension. Let a legal hop be defined as a hop in dimension d that takes the message
`closer to its destination.
`Case 1: If the current host and destination match in dimension d', and the legal hop is
`available, it is taken (see Fig. 3).
`Case 2: If Mis a message on an f-ring, it is routed using the virtual channel and orientation
`shown in Table 1 until it reaches the other parallel side of the f-ring such that current host and
`destination match ind' (see Fig. 3).
`
`Fig. 2 Fault-tolerant routing logic for 2D tori
`
`block-fault model, a nonfaulty node can have faulty
`links incident in at most one dimension. If x has no
`faulty links incident on it, then a hop in either dimen(cid:173)
`sion will take the message closer to its destination. If x
`has one or both links in a dimension faulty, then one
`of the links in the other dimension takes the message
`closer to its destination. In all cases, the message can
`move closer to its destination and cannot be blocked.
`This contradicts the assumption that the message is
`blocked at x.
`To distinguish blocked messages from others, we use
`the concept of message status, which could be unaf(cid:173)
`fected or affected. A message is injected into the net(cid:173)
`work with the unaffected status. When an unaffected
`message is blocked by a fault, its status is changed to
`affected, and it retains this status for the remainder of
`its journey. In our method, when a message becomes
`affected, it starts using a special class of virtual chan(cid:173)
`nels. The class of virtual channels used by an affected
`message is based on the dimension and direction it
`needs to travel to reach its destination.
`Consider a message with destination d = (d1, d0). Let
`it become affected at node x = (x 1, x 0). From lemma 1,
`it is clear that the message needs to travel in only one
`dimension; that is, either x 1 = d1 or x0 = d0• In each
`dimension, there are two possible directions. Thus,
`there can be four different types of affected messages.
`o+ ,o-, 1 +, and 1- . The message is termed a o+ -message if
`d1 = x 1 and d0 > x0• Furthermore, it is a o+ M message
`if it will not use a wraparound link in dimension O; oth(cid:173)
`erwise it is a o+ W-message. There are eight types of
`affected messages for a 2D torus and they are given in
`Table 1. When a message becomes affected, its type is
`determined and assigned. This type is used to deter(cid:173)
`mine the virtual channel class to be used for the
`remainder of the message's journey, and the orientation
`(direction of travel) to be used when routed on an/(cid:173)
`ring. Table 1 gives this information for each of the
`eight possible message types.
`
`3. 1 Routing affected messages
`The fault-tolerant version of a generic fully-adaptive
`algorithm F, denoted F1, uses four virtual channels, -
`in addition to those used by F. Chan(cid:173)
`c0,c 1,c2 and c3 -
`nel c0 is used exclusively by o+ affected messages (both
`o+ Mand o+ W); similarly, virtual channel classes c1,c2,c3
`are used exclusively by o-, 1 +, 1- messages, respectively.
`Rules to route various messages are specified in proce(cid:173)
`dure 'Fully-Adaptive-2D' (Fig. 2).
`
`Example: In Fig. 4, three faulty nodes-(1,0),(4,1) and
`(5,4)-are present. There are three f-rings corresponding
`to these three faulty nodes. Several messages in this
`Figure and their routes are indicated in Table 2. For
`example, the message from (5,2) to (3,1) is first routed
`from (5,2) to (5,1). It becomes affected by fault (4,1) at
`(5,1). Since it needs to travel from (5,1) to (3,1), it is
`labelled as a 1-M message. From the rules for 1-M
`messages (see Table 1), it is routed in the clockwise ori(cid:173)
`entation using virtual channel c3.
`
`dimens;oo d +
`
`d'
`
`Fig. 3 Routing of the message from ( Xi, Xo) to ( Y1, Yo) is done using
`'Fully-Adaptive-2D '; Case 1 is used
`Case 2 of Rule 2 of procedure
`thereafter
`
`---------- +~
`r-
`-~---
`~~~~=:iiQ-:t--:----<H; __ ~--i---:---<;>:-::--:-:----<~~J
`C2
`(0,4)
`C0;5)
`1
`I i
`I i
`l l
`(1,5)
`
`Fig.4 There are three f-rings corresponding to the three faulty nodes
`(1,0). (4.1) and (5,4). Various messages in this network and how they are
`routed is indicated in Table 2
`
`IEE Proc.-Comput. Digit. Tech., Vol. 142, No. 6, November 1995
`
`389
`
`
`
`Table 2: A few messages in the faulty network of Fig. 4
`
`source
`
`destination affected at
`
`affected by message
`faulty node
`type
`
`virtual
`channel
`
`orientation
`
`path indicated by
`
`(0,0)
`
`(1, 1)
`
`(0,4)
`
`(5,3)
`
`(5,2)
`
`(4,2)
`
`(2,0)
`
`(1,5)
`
`(4,4)
`
`(5,5)
`
`(3, 1)
`
`(4,0)
`
`(0,0)
`
`( 1, 1)
`
`(0,4)
`
`(5,3)
`
`(5, 1)
`
`(4,2)
`
`(1,0)
`
`(1,0)
`
`(5,4)
`
`(5,4)
`
`(4,1)
`
`(4,1)
`
`1•M
`o•w
`1•w
`o+M
`
`rM
`o-M
`
`G2
`
`Co
`
`C2
`
`Co
`
`C3
`
`c,
`
`clockwise
`
`solid thick lines
`
`co u nte r-c I ockwise
`
`dashed thick lines
`
`counter-clockwise
`
`solid thick lines
`
`clockwise
`
`clockwise
`
`clockwise
`
`dashed thick lines
`
`dashed thick lines
`
`solid thick lines
`
`Theorem 1: Assume that the fully-adaptive routing
`algorithm F correctly routes messages and is deadlock
`free. The fault-tolerant fully-adaptive routing algorithm
`F1 described by Rules 1 and 2 (in Fig. 2) is deadlock(cid:173)
`free in the presence of multiple rectangular fault blocks
`and delivers messages correctly between any pair of
`nonfaulty nodes.
`Proof Algorithm F1 correctly routes unaffected mes(cid:173)
`sages between any pair of nonfaulty nodes, since F cor(cid:173)
`rectly routes messages. Since the virtual channels used
`for affected messages are different from those used for
`unaffected messages, the statement is true for all unaf(cid:173)
`fected messages. To complete the proof, we need to
`show that affected messages are also routed correctly
`without deadlocks and livelocks.
`To see that the procedure Fully-Adaptive-2D cor(cid:173)
`rectly delivers messages, observe that (i) an affected
`message is misrouted only around an !-ring, (ii) a mes(cid:173)
`sage, once it leaves anj-ring will never revisit it, (iii) an
`affected message takes only a finite number of hops on
`each f ring and (iv) there are a finite number off rings
`in the torus. These four observations show that a mes(cid:173)
`sage is delivered to its destination in a finite number of
`hops and that there are no livelocks in the system.
`We next prove the deadlock-freedom of the proce(cid:173)
`dure Fully-Adaptive-2D. Unaffected messages cannot
`be involved in a deadlock, since Fis deadlock-free and
`since they do not require or wait for the virtual chan(cid:173)
`nels c0 , c1, c2 and c3 . Among affected messages, o+-mes(cid:173)
`sages use only virtiial channel c0; similarly, a distinct
`virtual channel is used for messages of each type. Thus,
`it is enough if we show that there are no deadlocks
`among o+ -messages.
`There are two types of o+ messages: o+ M and o+ W.
`The part of the network used by o+ M messages consists
`of row channels in the East direction and column chan(cid:173)
`nels in the North direction in the West columns and
`South channels in the East columns of f-rings. Its
`underlying graph is acyclic. Similarly, o+ W uses an acy(cid:173)
`clic network of c0 channels. Therefore, a deadlock
`among o+ -messages involves both o+ M and o+ W mes(cid:173)
`sages. But o+ M and o+ W messages use disjoint sets of
`physical channels. This is obvious for the physical
`channels that are not part of the f-ring, since o+ Mmes(cid:173)
`sages travel from West to East, and o+ W messages
`from East to West when not misrouted. From Table 1,
`it is clear that o+ M and o+ W messages reserve virtual
`channels on physical channels in clockwise and coun(cid:173)
`ter-clockwise directions, respectively, on an f-ring.
`Therefore, there is no dependency among the o+ mes(cid:173)
`sages. Similarly, we can prove the deadlock freedom
`for other types of messages.
`
`3.2 Fault-tolerant routing with overlapping
`f-rings
`Thus far, we have assumed that faults are such that the
`!-rings do not overlap. However, this is not a serious
`restriction. If !-rings overlap, then deadlock-free fault(cid:173)
`tolerant routing may be provided using either one of
`the following methods.
`(a) When two !-rings overlap, deadlocks may occur
`when, for example, o+ M and o+ W messages use the
`same physical channels in the column shared by the
`overlapping !-rings. This can be avoided by using two
`virtual channels (instead of one) for o+ messages. Simi(cid:173)
`larly, two virtual channels for each of o-, 1 +, 1- message
`types are needed. Therefore, this technique requires
`eight extra virtual channel