`
`INTE 111111111111111111~11ij'~ij~ 11\l~iiii lllllillf i[II iii! 1111111111 11111111 TIDN NElWDRKS
`
`~) .. :r parallel and distributed processing
`Chuan-lin Wu and Tse-yun Feng
`
`"(::'.ifl:1l
`
`.'_°;, rf\~ -
`
`IEEE CATALOG NUMBER EHO217-0
`LIBRARY OF CONGRESS NUMBER 84-81319
`IEEE COMPUTER SOCIETY ORDER NUMBER 574
`ISBN 0-8188-0574-X
`
`1ffil-.98'i THE INSTITUTE OF ELECTRICAL AND ELE~TRONICS ENGINEERS INC
`
`· -- - - -
`
`1
`
`•
`
`•
`
`IEEE COMPUTER SOCIETY
`
`I
`
`COMPUTER
`SOCIETY~
`PRESS
`~
`4'
`
`INTEL - 1008
`
`
`
`QA
`70,9
`. b5
`wg
`e.., I
`
`TU10RIAL:
`
`INTERCONNECTION NETWORKS
`for parallel and distributed processing
`Chuan-lin Wu and Tse-yun Feng
`
`IEEE CATALOG NUMBER EHO217-0
`LIBRARY OF CONGRESS NUMBER 84-81319
`IEEE COMPUTER SOCIETY ORDER NUMBER 574
`ISBN 0-8186-0574-X
`
`fffil;.~BY THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, INC.
`
`COMPUTER
`SOCIETY~
`PRESS ~®
`
`ffi, IEEE COMPUTER SOCIETY
`
`INFORMATION CENTER
`LOGICON S&ISO SAN PEDRO
`
`INTEL - 1008
`
`
`
`Published by IEEE Computer Society Press
`1109 Spring Street
`Suite 300
`Silver Spring, MD 20910
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source.
`Libraries are permitted to photocopy beyond the limits of U.S. copyright law for
`private use of patrons those articles in this volume that carry a code at the bottom
`of the first page, provided the per-copy fee indicated in the code is paid through the
`Copyright Clearance Center, 29 Congress Street, Salem, MA 01970. Instructors are
`permitted to photocopy isolated articles for noncommercial classroom use without
`fee. For other copying, reprint or republication permission, write to Director, Publish(cid:173)
`ing Services, IEEE, 345 E. 47 St., New York, NY 10017. All rights reserved. Copy(cid:173)
`right © 1984 by The Institute of Electrical and Electronics Engineers, Inc.
`
`IEEE Catalog Number EHO217-0
`Library of Congress Number 84-81319
`IEEE Computer Society Order Number 574
`ISBN 0-8186-0574-X (Paper)
`ISBN 0-8186-4574-1 (Microfiche)
`
`Order from: IEEE Computer Society
`Post Office Box 80452
`Worldway Postal Center
`Los Angeles, CA 90080
`
`IEEE Service Center
`445 Hoes Lane
`Piscataway, NJ 08854
`
`•-W---
`1ffif-.Bl.f
`
`The Institute of Electrical and Electronics Engineers, Inc.
`
`ii
`
`INTEL - 1008
`
`
`
`2 1 SEP 1987
`
`Preface
`
`This tutorial presents fundamentals in interconnection networks, a crucial topic in the field of parallel/distributed
`processing. The interconnectio~ network consists of software and hardware entities that are designed to facilitate efficient
`interprocess and interprocessor communication in a parallel processing system. Although conventional computer systems
`using the von Neumann model do not have many system units to be connected and the processes are sequentially executed,
`an interconnection network such as the time-shared bus is still a critical system component. Today, system units such as
`processors and memory modules have increased rapidly in new computer systems, and processes are executed in parallel
`to meet real-time requirements. The design of interconnection networks is emerging as the most basic issue in exploiting
`parallelism. The interconnection network not only has a profound impact on algorithm design, but also greatly affects
`system level control.
`In spite of the importance of the interconnection network in future computer system design, the issues of designing
`interconnection networks have never been fairly and completely explored in a single publication. Even worse, researchers
`often ignore its impact and stubbornly think that the interconn~ction network is nothing but a communication idiot. It is a
`fallacy for people to think that there is enough work in interconnection networks . In view of this, the authors believe that a
`tutorial is urgently needed to promote more fruitful research efforts in the field of parallel/distributed processing.
`The text of this tutorial is designed for system designers, programmers, educators, and those who are involved in the
`research , development, and application of various special-purpose and general-purpose computer systems . It is tutorial in
`nature and provides a state-of-the-art survey. It is expected that the tutorial will serve as a guide for beginners and as a
`major reference for all computer professionals. It is hoped that, after going through the text, readers will be able to design
`interconnection networks that fit their computer architecture needs, design better algorithms by taking advantage of what
`the interconnection network can offer, write better programs by knowing the limitations of the interconnection network,
`and trigger a revolution on the system control concept.
`The text is organized into 12 chapters. A tutorial guide, provided at the beginning of each chapter, explains the idea and
`concept and also annotates an up-to-date reference list. Each chapter includes some reprint articles which explain
`underlying theory and developments.
`Chapter 1 provides a comprehensive overview of interconnection networks, with a discussion of design trends and
`design issues.
`Chapter 2 surveys underlying network topologies used in parallel/distributed system architecture. The topologies are
`divided into two categories: static and dynamic .
`Chapter 3 focuses on control strategies for routing data through interconnection networks . Various routing algorithms
`are covered.
`Chapter 4 discusses how to realize data permutations in an array processing mode, with an exploration of permutation
`capability of various kinds of interconnection networks.
`Chapter 5 explores the performance evaluation of interconnection networks . Evaluation parameters considered include
`bandwidth, message delay, diameter, and effectiveness in simulating other networks .
`Chapter 6 is concerned with fault diagnosis for detecting and locating faults in interconnection networks. Fault models
`and test procedures are considered.
`Chapter 7 investigates fault-tolerance characteristics of interconnection networks, as well as ways to achieve fault
`tolerance.
`Chapter 8 discusses issues in using very-large-scale-integrated (VLSI) circuits to design interconnection networks.
`Area and delay models of layouts are also formulated .
`Chapter 9 covers reconfiguration techniques that are concerned with allocating network resources to achieve variable
`topology capability. Different approaches are discussed .
`Chapter 10, on mapping, is concerned with how to partition and assign program modules to resources such as processors
`and memories, subject to minimizing the total number of routing steps. The mapping is done under the assumption that
`either (or both) the algorithm or interconnection network is fixed.
`Chapter 11 explores network synthesis, which derives the optimum network topology for a transformed algorithm. Both
`algorithms and networks are treated as variables in the procedure.
`Finally, Chapter 12 provides some examples on the construction of multistage networks. It should be noted that every
`system needs an interconnection network. The examples provided are used to point out the trends.
`
`111
`
`INTEL - 1008
`
`
`
`We would like to take this opportunity to thank all the authors who have contributed to this tutorial text. We apologize to
`those authors whose valuable papers could not be included due to page limitations. We tried to be as complete as possible
`in compiling the reference lists, which were based on contribution and accessibility of the work. We deeply regret any
`omissions and hope that they will be made known to us so that they can be included in later updates. Finally, we would like
`to thank the reviewers for their helpful criticisms and suggestions, and Margaret Brown and her staff of the IEEE Computer
`Society Press for editing the text.
`
`Chuan-lin Wu, University of Texas at Austin
`Tse-yun Feng, Pennsylvania State University
`
`IV
`
`INTEL - 1008
`
`
`
`Published by IEEE Computer Society Press
`1109 Spring Street
`Suite 300
`Silver Spring, MD 20910
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source.
`Libraries are permitted to photocopy beyond the limits of U.S. copyright law for
`private use of patrons those articles in this volume that carry a code at the bottom
`of the first page, provided the per-copy fee indicated in the code is paid through the
`Copyright Clearance Center, 29 Congress Street, Salem, MA 01970. Instructors are
`permitted to photocopy isolated articles for noncommercial classroom use without
`fee. For other copying, reprint or republication permission, write to Director, Publish(cid:173)
`ing Services, IEEE, 345 E. 47 St., New York, NY 10017. All rights reserved. Copy(cid:173)
`right © 1984 by The Institute of Electrical and Electronics Engineers, Inc.
`
`IEEE Catalog Number EHO217-0
`Library of Congress Number 84-81319
`IEEE Computer Society Order Number 574
`ISBN 0-8186-057 4-X (Paper)
`ISBN 0-8186-457 4-i (Microfiche)
`
`Order from: IEEE Computer Society
`Post Office Box 80452
`Worldway Postal Center
`Los Angeles, CA 90080
`
`IEEE Service Center
`445 Hoes Lane
`Piscataway, NJ 088~4
`
`1$~8'-I
`·--,...,__ The Institute of Electrical and Electronics Engineers, Inc.
`
`ll
`
`INTEL - 1008
`
`
`
`Operation mode. Two types of communication can
`be identified: synchronous and asynchronous. Synchro(cid:173)
`nous communication is needed for processing in which
`communication paths are established synchronously for
`either a data manipulating function 11 or a data/instruc(cid:173)
`tion broadcast. Asynchronous communication is needed
`for multiprocessing in which connection requests are
`issued dynamically. A system may also be designed to
`facilitate both synchronous and asynchronous process(cid:173)
`ing. Therefore, typical operation modes of interconnec(cid:173)
`tion networks can be classified into three categories: syn(cid:173)
`chronous, asnychronous, and combined.
`
`two processors are passive and dedicated buses cannot be
`reconfigured for direct connections to other processors.
`On the other hand, links in the dynamic category can be
`reconfigured by setting the network's active switching
`elements.
`The cross product of the set of categories in each de(cid:173)
`sign decision-[operation mode) x [control strategy] x
`[switching methodology] x [network topology ]-repre(cid:173)
`sents a space of interconnection networks. Obviously, the
`cross product contains some uninteresting cases, but a
`network designer can obtain a meaningful subspace by
`exercising a practical view of engineering technology.
`
`Control strategy. A typical interconnection network
`consists of a number of switching elements and intercon(cid:173)
`necting links. Interconnection functions are realized by
`properly setting control of the switching elements. The
`control-setting function can be managed by a centralized
`controller or by the individual switching element. The lat(cid:173)
`ter strategy is called distributed control; the first strategy
`is called centralized control..
`
`Switching methodology. The two major switching
`methodologies are circuit switching and packet switching.
`In circuit switching, a physical path is actually established
`between a source and a destination. In packet switching,
`data is put in a packet and routed through the intercon(cid:173)
`nection network without establishing a physical connec(cid:173)
`tion path. In general, circuit switching is much more suit(cid:173)
`able for bulk data transmission, and packet switching is
`more efficient for short data messages. Another option,
`integrated switching, includes capabilities of both circuit
`switching and packet switching. Therefore, three switch(cid:173)
`ing methodologies can be identified: circuit switching,
`packet switching, and integrated switching.
`
`Network topology. A network can be depicted by a
`graph in which nodes represent switching points and
`edges represent communication links. The topologies
`tend to be regular and can be grouped into two categories:
`static and dynamic. In a static topology, links between
`
`Topologies
`
`Network topology is a key factor in determining a
`suitable architectural structure, and many topologies
`have been. considered for telephone switching connec(cid:173)
`tions. 12 Here, we review those proposed or used for con(cid:173)
`nections in tightly coupled multiple-processor systems
`(see Figure 3).
`
`Figure 2. Hardware model of com::urrent processing
`systems.
`
`Figure 3. Topologies of interconnection networks.
`
`December 1981
`
`6
`
`INTEL - 1008
`
`
`
`Figure 4. Examples of static network toplogies: (a) one dimensional; (b-f) two dimensional; and (g•D three dimensional.
`
`7
`
`COMPUTER
`
`INTEL - 1008
`
`
`
`iEEE TRANSACTIONS ON COMPUTERS. VOL. C-30. NO. 2. FEBRUARY 1981
`
`I/0
`~
`
`I
`
`UNIT
`
`'.1E.'10RY
`
`' '
`·- CONTROL - PROGRAM
`,n ELEMENT
`·- ELEMENT
`• • •
`
`I IPROCESSING
`
`0
`
`,..__
`
`DATA
`MEMORY
`
`PROCESSING
`
`l ~
`DATA
`}!EMORY
`
`-
`'.___ ELEMENT ----
`
`PROCESSING
`
`N-1
`
`DATA
`MEMORY
`
`I
`I
`
`INTER-
`CONNECTION
`NETWORK
`
`-
`
`plexity of Thompson's algorithm is determined by the com(cid:173)
`plexity of the GCN set-up algorithm. The best known parallel
`GCN set-up algorithms (Nassimi and Sahni [5]) have a
`complexity of 0(n) on an n X n MCC and 0(log4 N) on both
`CCC's and PSC's with N PE's.
`In this paper we present an algorithm for the RAR problem,
`which runs in 0(q2n) time on a q-dimensional n X n X · · · X
`n MCC and in 0(log2 N) time on an N PE PSC or CCC. Thus,
`the algorithm of this paper is asymptotically faster than the
`Thompson-Nassimi-Sahni algorithm for CCC's and PSC's.
`For MCC's we expect our algorithm to be significantly faster
`than the Thompson-Nassimi-Sahni algorithm, as the algo(cid:173)
`rithm of this paper is significantly simpler and has much less
`overhead.
`The RAW problem can be solved using the subalgorithms
`developed for the RAR problem. The complexity of the RAW
`algorithm is the same as that of the RAR algorithm when each
`PE is to receive at most one piece of data. A more general form
`of a RAW occurs when, at most, d data items are written into
`any one PE. In this case the complexity of the RAW becomes
`O(q 2n + dqn) on a q-dimensional n X n X · · · X n MCC and
`0(Iog2 N + d log N) on an N PE CCC or PSC.
`Section II describes how RAR's and RA W's are to be per(cid:173)
`formed. In Section III we present the necessary subalgo(cid:173)
`rithms.
`
`II. ALGORITHMS FOR RAR's AND RA W's
`
`I/
`0
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`' I
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`Fig. l. Block diagram of an SIMD computer.
`
`RAR's and RA W's are performed using certain well-de(cid:173)
`fined steps. These are described below.
`l) SORT: In a sort records are rearranged so as to be in
`nondecreasing order of a specified key. Let G(i) denote the
`record in PE(i), 0 ~ i < N. Let H(i) be the key field of record
`G(i). H(i) is also in PE(i). Following a sort, the records will
`have been rearranged such that H(i) :s; H(i + 1), 0 .:Si< N
`!oxes represent ?Es
`-- l.
`(b)
`(c)
`(a)
`2) RANK: The rank of a selected record is the number of
`Fig. 2. The SIMD models. (a) 4 X 4 MCC. (b) 8 PE CCC. (c) 8 PE
`selected records in PE's with a smaller index. For example,
`PSC.
`of records. The initial configuration is record G(i) in ·PE(i),
`assume we have eight PE's each containing one record. Let the
`0 5 i ~ j < N. Each record has a field H (high). The H values
`key values for these eight records be (6, 4, 2, 2*, 6, 6*, 3*, 4*),
`are such that O .:S H(O) < H(l) <···<HU) ~ N - 1, and
`where an asterisk over a key value denotes a flag or selected
`H(i) = cc for j < i < N. Generalize copies record G(i) into
`record. The ranks of the flagged records are ( - , - , - , 0, - , 1,
`l) + l through H(i), 0 5 i ~ j ( we assume, for
`PE's H(i -
`2, 3).
`convenience, H(- 1) = 0). Let G(0:7) = (A, B, C, -, -, -,
`3) CONCENTRATE: Let G(i,) 0 ::£ r ::£ j < N be a set of
`-, -,) and let H(0:7) = (1, 5, 6, cc, cc, cc, cc, cc). Following a
`records with GUr) initially in PE(i, ). Assume that the records
`have been ranked so that HUr) = r. A concentrate results in
`generalize, G(0:7) = (A, A, B, B, B, B, C, - ).
`record GUr) being moved to PE(r), 0 ~ r ~}.Assume that
`Our RAR algorithm is best described by considering an
`G(0:7) =(A,-.-, B, -, C, -, D) and i0 = 0, i1 = 3, i2 = 5,
`example (Fig. 3). In this example we have N = 8 PE's and
`S(0:7) = (2, 6, 2, oo, 5, 6, 00 , 6). (Recall that S(i) specifies the
`and i 3 = 7. Following a concentrate, G(0:7) = (A, B, C, D, -,
`-. -. -).
`PE from which the data for PE(i) is to be fetched, and S(i) =
`4) DISTRIBUTE: Let G(i), 0 ~ i ~ j < N be a set of
`00 iff PE(i) is to receive no data.) Let T(i) = i and FLAG(i)
`= 1, 0 :s; i < N. Our RAR algorithm begins by sorting the
`records with G(i) initially in PE(i). Let H(i), 0 :5 i 5 j be a
`set of destinations such that H(i) < H(i + 1 ), 0 5 i <}.The
`records G(i) = (S(i), T(i), FLAG (i)). Records are sorted
`on S; T is used to resolve ties (i.e., records with the same S
`purpose of a distribute is to route G(i) to PE(H(i)), 0 ~ i :5
`value are ordered by their T value). The sorting algorithm we
`j. It is easy to see that a distribute is the inverse of a concen(cid:173)
`trate. As an example, suppose that G(0:7) = (A, B, C, -, -,
`shall use ( see Section III) will be a comparison sort. We require
`-, -, -) and that H(O) = I, H(l) = 5, and H(2) = 6. Fol(cid:173)
`that during the sort whenever a comparison between G(i) and
`GU) is made, if S(i) = SU) and T(i) < TU), thell FLAG(i)
`lowing a distribute, G(0:7) =(-,A,-,-, -, B, C, -).
`5) GENERALIZE: A generalize makes multiple copies
`is set to zero. As a result of this, following the sort, FLAG(i)
`283
`
`INTEL - 1008
`
`