`
`Architectural Tradeoffs in
`Field-Programmable-Device-Based Computing Systems
`Pak K. Chau'and Martiue D.F. Schlagt
`Computer Engineering
`University of California, Santa Cruz
`Santa Cruz, California 95064 USA
`
`Reprogrammable Field-Programmable Gate Ar
`rays (FPGAs) have enabled the realization of high-
`performance and affordable reconfigurable comput
`ing engines. We examine the architectural tradeoffs
`involved in designing general purpose FPGA-based
`computing systems with field-programmable gate ar
`rays and field-programmable interconnects. The
`fact
`that FPGAs provide both programmable logic and
`programmable interconnects raises numerous design
`issues that need to be considered with care. Factors
`that influence the tradeoffs are routability, rearrange-
`ability, and speed.
`
`I FPGA-based computing systems
`Reprogrammable FPGAs form the basis of high-
`performance and affordable reconfigurable comput
`ing, prototyping and emulation systems [1, 2, 3, 4, 5,
`6, 7]. Interconnections schemes in these systems vary
`quite a bit. In SPLASH, PAM, QUICKTURN GANGLION
`and ANYBOARD
`[1,3, 4, 5, 7],
`the FPGAs are directly
`connected to ea
`ch other in a
`fixed pattern. In BORG
`and REALIZER [6, 2], logic and interconnection are
`separate. The REALIZER uses the partial crossbar
`interconnection architecture to connect the FPGAs,
`while the borg has complete interconnection. In
`this paper, we shall investigate the design of inter
`connection architectures using multiple crossbars and
`discuss the architectural tradeoffs in the design of
`such FPGA-based computing systems.
`We view FPGA-based computing systems as con
`sisting of reprogrammable FPGAs, and reprogram
`mable interconnects. A number of commercial de
`vices can serve as field-programmable interconnects:
`Aptix FPlCs have a place and route architecture
`which is not a crossbar 18]. Routing delay
`is not
`entirely controllable and predictable as a
`result
`of its place-and-route architecture. Each FPIC
`chip has a maximum of 940 user programmable
`I/Os.
`IQ160 Each FP1D chip has a maximum of 160 user
`programmable I/Os. The heart of the IQ160 is a
`176 x 176 crossbar. Every port can be configured
`to connect to any port. Routing delay is entirely
`’Supported in p*rt by NSF Grant MIP-9196276 Si MICRO
`MICRC?POrted iU P,k^, b> NSF PY' Gr<UU MIp-8896276 4
`0-818S-3890-7/93 *03.00 © 1993 IEEE
`
`132
`
`controllable and predictable as a result of this
`crossbar architecture (9].
`TI crossbar The Texas Instrument SN74ACT8841
`is a 16 x 16 crossbar with 4-bit word length.
`The crossbar is organized as 16 ports of 4-
`bit each. Routing delay is entirely controllable
`and predictable as a result of this crossbar
`architecture. Each crossbar has a maximum of
`128 user programmable I/Os.
`FPGAs have a place and route architecture. Rout
`ing delay is not entirely controllable and pre
`dictable as a result of this place-and-route ar
`chitecture [10]. Each FPGA typically has a
`maximum of 200 user programmable I/Os.
`Function-wise, the FPGAs and Aptix chips can be
`conceived of as devices capable of performing as
`crossbars under many circumstances (ignoring their
`unpredictable
`delays),
`This will be the view of
`this paper and
`we shs
`all use switches and crossbars
`interchangeably in our discussion.
`While all of the devices mentioned above can serve
`as programmable interconnects to provide flexible
`interconnection between computing elements in an
`FPGA-based computing system, the number of inter
`connections required may easily exceed the number of
`user programmable I/Os on a single device. We need
`to consider building a larger interconnection structure
`from these basic device packages which will be viewed
`as small crossbars. In this regard, the problem of
`building large crossbars from smaller crossbars is
`well studied in the context of telephone switching
`networks. Clos and Bcnes networks are representative
`of rearrangeable networks made up of small crossbars.
`We shall review the basic terminology, concepts and
`known results related to Clos networks in Section 11.
`The notion of foldcd-Clos networks is introduced
`to facilitate the application of known results and
`a control algorithm to configure the crossbars in
`an FPGA-based system consisting of multiple small
`crossbars. This is one of the contributions of this
`paper. Then, we show that the fact that FPGAs have
`dual personality (both as programmable logic and
`programmable interconnects) which can be exploited
`to enable system tradeoffs in the design of an
`FPGA-based system using both FPGAs and field-
`programmable interconnect. The issues of speed,
`rearrangeability and routability are considered in
`Section IV.
`
`IPR2018-01600
`
`EXHIBIT
`2064
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 1
`
`
`
`middle switches output switches
`r = 4
`---------
`
`X =
`XE
`X
`X
`
`i
`
`* 4
`
`X
`X
`
`input switches
`m = 3
`
`3
`
`- {EiX:
`X
`EX:
`EX
`
`Figure 1: A three-stage Clos network C(3,3,4).
`
`II Rearrangeable CLOS switching
`networks
`We review some basic terminology and concepts
`related to Clos networks in this section.
`A connecting network is an arrangement of switches
`and interconnects allowing a set of input terminals to
`be connected to a set of output terminals in various
`combinations [11, 12]. All connections are point-
`to-point connections. An assignment for a given
`network is a list of input/output pairs which are
`to be connected; each terminal appears in at most
`one pair. An assignment is realizable if there exist
`disjoint paths in the network connecting all pairs
`of input^/output terminals in the assignment. A
`network is rearrangeable if any assignment is realiz
`able. Nonblocking networks have the property that in
`addition to being rearrangeable, any connection can
`be established between any idle pair of input/output
`terminals without rearrangements.
`Three-stage symmetrical Clos network consists of
`two symmetrical outer stages of r n x m switches and
`an middle stage of m r x r switches. The network
`is completely characterized by three parameters n, m
`and r, or C(n,m,r) collectively. A C(3,3,4) Clos
`network is shown in Figure 1. The Slepian-Duguid
`theorem states that a 3-stage Clos network C(n, m, r)
`is rearrangeable if and only if m > n [11, 13]. For our
`purpose, we assume n = m, hence all switches arc
`square and our Clos networks are rearrangeable.
`1
`Clos shows that a 3-stage Clos network C(», m, r
`is nonblocking in the strict sense if m > 2n —
`[14]. Clos also illustrates that it is possible to
`apply the principle recursively to form multi-stage
`rearrangeable or nonblocking networks. Clos assumes
`that all connections are point-to-point connections.
`In Section V, we shall consider an extension to multi
`pin connections.
`Although rearrangeable networks generally use
`
`153
`
`fewer switches, they require careful and sometimes
`hard routing ^ontrol algorithms. On the other hand,
`non-blocking, networks require less demanding or no
`control algorithm [15]. The next section presents
`a known control algorithm for rearrangeable Clos
`networks based on bipartite matching [12j.
`II-A A control algorithm for rearrangeable
`Clos networks
`Given an assignment A, a bipartite graph G(A) =
`can be constructed, where the vertex set
`<,VUV2,E)
`V\ consists of the input switches, and the vertex set
`Vi consists of the output switches. There is an edge
`between switch S 6 Vj and T 6 Vi for each pair of
`input/output terminals in the assignment where the
`input terminal belongs to S and the output terminal
`belongs to T. The edges of this bipartite graph can
`always be partitioned into m disjoint matchings, and
`then each matching provides the at most r pairs
`of input/output terminals which can be realized by
`a single r x r switch in the middle stage of the
`three-stage Clos network. Thus, the problem of
`setting up the switches in a three-stage rearrangeable
`Clos network C(n, m, r) can be formulated as edge
`coloring with m colors (or equivalently partitioning
`the edges into m matchings) for the bipartite graph
`G(-4) where edges incident to the same vertex must
`be of distinct colors [12]. This problem can be solved
`in polynomial time [16].
`Ill Architecture of FPGA-based
`systems
`With the basics defined in the previous sections, we
`are now in a position to discuss FPGA-based systems
`consisting of the following components:
`FPGAs: which can serve as computing elements,
`programmable interconnect or both. Figure 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 2
`
`
`
`I/O pads
`
`ni = 3
`
`LOGIC
`
`ar®
`i®M SI
`
`Figure 2: An FPGA modeled as consisting of programmable logic and switches connecting the nets to I/O pads.
`
`Rl
`
`Routing chip
`
`"1
`
`------- 1
`i
`
`I
`
`h
`
`I-------
`i
`
`r ~
`I
`
`*“1
`i
`
`i--------
`
`L
`
`u
`
`L
`
`R2
`
`____ i
`
`—i
`
`J
`
`J
`
`LOGIC
`
`FPGA
`
`LOGIC
`
`FPGA
`
`Routing chip
`Figure 3: The BORG prototyping board viewed as as a three-stage C(2,2,4) Clos rearrangeable network.
`
`illustrates the last case in which an FPGA is
`viewed as a device consisting of programmable
`logic and a number of m x nj “switches”
`connecting the logic to the programmable I/O
`pads. The «i x nj “switches” are realized by
`igning the nets among the ti\
`rerouting or reassi
`pads. Th
`e size o
`f these switches (the amount
`
`of routing resources used to implement them)
`affects the routability of the FPGAs and hence
`the amount of logic that can be implemented on
`the device.
`The software implication of this model is that
`we do not lock the pins to conform with board-
`level constraints. This is advantageous since
`
`154
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 3
`
`
`
`outer stage
`
`middle stage
`
`X
`
`computing
`element
`
`computing
`element
`
`X
`
`<x
`
`computing
`element
`
`computing
`
`x
`X
`element X
`Figure 4: A folded-Clos network with n = m = 3, and r = 4.
`
`reassigning or locking pins generally affects the
`routability of the FPGAs [10]. Instead, we allow
`the automatic placement and routing tool to
`choose freely the pin assignment. Afterward,
`one more routing step is required to use the
`“switches” to reroute the nets to conform with
`the board-level constraints.
`Memory elements serve as local memories to the
`computing elements.
`Host interface serves to communicate the FPGA-
`based computing system with the host machine
`to which the FPGA-based system is attached.
`Programmable routine whose sole purpose is to
`provide programmable interconnect amongst the
`computing elements and/or memory elements.
`We assume that the number of interconnec
`tions required exceeds the number of user pro
`grammable I/Os on a single programmable rout
`ing device. The major issue is then the prob
`lem of building a large switch from the smaller
`switches.
`We shall defer until Section VI the discussion of the
`impact of introducing other devices such as memory
`elements and the host interface to the system. So
`until then, we assume that there are only FPGAs
`and programmable interconnect in the system. As
`an example, consider the small programmable proto
`typing board BORG [6] which has two routing chips;
`it can be viewed as a three-stage rcarrangeablc Clos
`network. The FPGAs serve the dual purpose of
`programmable logic and programmable interconnec
`tion. The switches inside the FPGAs, provided by
`
`155
`
`interchanging the nets among the programmable I/O
`pads, form the outer stages of the Clos network.
`The middle stage is realized by the FPGAs R1 and
`R2 which serve as routing chips, as illustrated in
`Figure 3. The fact that the I/O pads of the FPGA
`are connected to the routing chips in an alternating
`fashion makes the system depicted in Figure 3 a three-
`stage rearrangeable Clos network with n = m = 2
`and r = 4. (The actual
`borg uses Xilinx XC3030s
`as routing chips resulting in a C(2,2,27) Clos rear
`rangeable network.)
`The scheme shown in Figure 3 only works for
`m = 2. In next section, we shall discuss how to use
`Clos networks when there are more than two FPGAs,
`for m > 2.
`XII-A Folded-Clos networks
`The Clos network and control algorithm outlined
`in Section II has a definite notion of input and
`output terminals. Therefore, in order to apply the
`control algorithm discussed in Section II one must
`identify the two terminals of a net either as input
`or output (irrespective of signal flow). This notion
`does not necessarily exist in an FPGA-based system,
`in which the computing elements have pads which
`can be reconfigured to be either input, output or
`bidirectional pads. The numbers and distribution of
`inputs/outputs change from application to applica
`tion.
`In this case it is most convenient to describe
`the programmable-routing component of an FPGA-
`based com;
`puling
`; system as a folded-Clos network,
`as depicted
`I in Fi
`igure 4. A three-stage folded-Clos
`network resembles a three-stage Clos network, but
`there is no distinction between input and output
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 4
`
`
`
`outer stage
`nr-r.X
`
`r-z^zX
`
`FPGAi
`
`LOGIC
`
`FPGAj
`
`Xl'
`X
`zn
`LOGIC X
`X
`
`middle stage
`
`X
`
`X
`
`X
`
`Figure 5: A folded-Clos network using the FPGAs as both computing element and programmable interconnect.
`
`stages. We simply call the stage which connects
`directly to the computing elements the “outer” stage;
`it consists of r pairs of m x m switches. The “middle"
`stage consists of m r x r crossbars.
`
`III-B A control algorithm for folded-Clos
`networks
`The control algorithm for the folded-Clos network
`depicted in Figure 4 can be adapted from the control
`algorithm for the three-stage Clos network depicted
`in Figure 1 as follows. First, directions are assigned
`to each of the nets so that each pair of m x m switches
`has m in-terminals and m out-terminals. To do this,
`we form a net-graph from the required connections,
`where each node corresponds to a group of 2m pins,
`and an edge is added for each net between the two
`nodes(groups) to which its two pins belong. We then
`traverse this net-graph visiting each edge once to
`partition the edges into edge-disjoint circuits (a node
`may appear more than once). We assign a direction
`to the edges as we traverse the net-graph. After the
`traversal, each node will have m in-edges and m out-
`edges.
`Once the 2m edges
`have been divided, we can
`unfolded the foled-Clos
`network by separating the
`2m terminals into two m
`a x m switches and obtain
`a network identical to the three-stage Clos network.
`The Clos network control algorithm can be used
`to configure the m x m switches to realize any
`assignment. Notice that the direction assigned to
`an edge is purely for the purpose of using the edge-
`colorability control algorithm described in Section II-
`A, the direction is unrelated to the actual signal flow.
`
`IV Architectural tradeoffs: FPGAs as
`both computing element and
`programmable interconnect
`In this section, we examine the architectural tradeoffs
`involved in the organization of an FPGA-based
`computing systems. The issues
`are;
`1. rearrangeability
`2. routability, and
`3. speed.
`Any point-to-point connection realized by the
`architecture presented in Figure 4 has the delay
`of 3 switches. It is possible to trade routability
`with speed by exploiting the dual personality of
`FPGAs. Figure 5 shows a system resembling a
`folded-Clos network with the FPGAs serving as both
`programmable interconnect and computing .1......... .
`elements.
`It is not difficult to show the network depicted in
`Figure 5 satisfies the Slepian-Duguid rearrangeability
`criterion (see Section II) providing the algorithm
`outlined in Section Ill-B has been used to identify
`the input and output stages. We eliminate two stages
`of switch delay in Figure 5 but risk the danger of
`having a non-rearrangeable network. This possibility
`is illustrated as follows. Assume that we control
`the utilization of the FPGAs such that they can be
`modeled as shown in Figure 2 with m = 3, r = 4.
`suppose that the utilizations of the FPGAs
`Now
`increase such that n = m decreases from 3 to 2 while
`the external connections remain unchanged, as shown
`in Figure 6. Keep in mind that the interconnections
`between the outer stage and the middle stage are
`fixed. It is not hard to see that the network now is
`
`156
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 5
`
`
`
`outer stage
`
`middle stage
`
`•m
`
`d,
`B,
`
`A? —
`b2 —S3
`
`;x
`
`)x
`
`'X
`
`Figure 6: An example to illustrate the effect of increase in FPGA utilization on rearrangeability.
`
`a non-Clos non-rearrangeable network. For example,
`the connections A\ —< Ai and B\ — Bj cannot be
`realized simultaneously.
`From an architectural standpoint, keeping the size
`of the outer switches, m, small (say 3) enhances the
`routability (and hence utilization) of the FPGAs. On
`the other hand rearrangeability requires that m be
`the number of switches in the middle stage of the
`folded-Clos network. For a given FPGA and crossbar
`device packages, this would determine the proper
`configuration of the FPGA-based system. If there
`are a total of y user I/Os per FPGA, a total of z
`user I/Os per crossbar, and F FPGAs as computing
`elements in the system, then since the total number
`of I/Os is roughly conserved (in the absence of “other
`devices”), we have
`xm
`(1)
`Fy
`m[xly\.
`F
`(2)
`For example, IQ160 FP1D has 160 user-l/Os per
`chip, i.e., z = 160. Given that m = 3, if we
`are using 80-user-I/O FPGAs, then F can be at
`most 6 in order to maintain rearrangeability. As
`another example, Aptix has 940 user-I/Os per chip.
`Given that m = 3, P can be at most 33 in order
`to maintain rearrangeability. Incorporating more
`FPGAs while maintaining rearrangeability requires
`increasing m beyond 3. As more FPGAs are added,
`the additional resources consumed by the “outer”
`switches will result in a diminishing logic capacity per
`FPGA. Conceptually, m should not reach the point
`where FPGA utilization is the same independent of
`whether the I/O pins were locked or not.
`
`V Considerations for multi-pin nets.
`The development so far had assumed that all connec
`tions are point-to-point connections. However in an
`FPGA-based system, if a signal is the source of mul
`tiple loads at different FPGAs, it is only reasonable
`to assume (with pin-limited packages) that this signal
`has only one source pin originating from one FPGA,
`and it fanouts in a miidh crossbar to reach different
`destinations. Figure 7 illustrates a net N\ with 3 pins
`fanout inside a crossbar.
`Given this assumption, the problem of configuring
`the crossbars to realize a given set of connections has
`no known solution. The closest related work in this
`area is for configuring non-blocking networks with
`definite signal direction (17). Our folded-Clos network
`doesn't have any definite signal flow direction and we
`only require rearrangeability.
`We can formulate the problem of configuring
`the crossbars to realize a given set of multi-pin
`connections in the following manner. Let
`• P be the set of all pins (terminals)
`• Oj be the set of pins (terminals) on outer
`switch j of the folded-Clos network
`« Al, be a net, Ni = (P,,i, P,,j,. •., P,.,.) C P
`where each Pjj is a pin, we have ((_!,_, A, C P).
`Since no two nets can share a pin,
`Vij (A/,nWj) = d
`(3)
`and only one pin per outer-stage switch for each
`net Ni is needed, we have
`v.-,/ |A,no,| < i.
`
`(4)
`
`167
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 6
`
`
`
`outer stage
`
`middle stage
`
`FPGA
`
`I
`
`LOGIC
`
`W,
`
`FPGA2
`
`LOGIC
`
`fpga3
`
`.LOGIC
`
`Figure 7: A multipin-net N\ with 3 terminals.
`We assume that the m middle switches are capable
`of realizing any interconnection pattern among their
`R = 2r pins. The problem is solved by assigning
`the nets to the m switches (coloring the nets with m
`colors)
`/ : Nets —• (1,2,..., m)
`while avoiding contention among nets
`V,,i f(Ni) = f(N,) => Vy|Oj-n(A'iUA'i)| < I. (5)
`Constraint (5) ensures that each outer switch has at
`most one net of each color, that is, at most one net
`per middle switch. This constraint also ensures that
`each middle switch will not be assigned too many
`connections. That is,
`E/<*.)=)
`(6)
`The multi-pin control problem is more difficult
`than its point-to-point counterpart. To illustrate,
`for example consider the following seven nets to be
`connected in the network as shown in Figure 8.
`A, = (P,.P,2,
`= (Pr, Pr), N, = (Ps, P9, Pis}.
`N, = (P., P», Pi.), Ns = {Ps.Pio. Pis). Ns = (P., P„),
`A'7 = (P,,,P„)
`A contention graph Q can be constructed according
`to incompatibility among the nets. The vertex set V
`in 0 arc the nets, there is an edge joining Uj and
`Vj if net Ni and Nj cannot be in the same middle
`switch. The contention graph for our example is
`shown on the left in Figure 9. Since there are m = 3
`
`middle switches, a feasible 3-coloring of the vertices
`of the contention graph should provide a valid routing
`connection. This is, however, not possible in this
`example. Hence the folded-Clos network is no longer
`rearrangcable when multi-pin nets are allowed.
`Note that in this particular example allowing
`slightly more flexibility in reassigning of the nets
`would admit a solution. If we swap the nets on
`pins P9 and Pio (which is the same as setting
`Ws = {P3.Px0.Pn} and Nr, = {Ps, P9, f’is}) the
`new contention graph shown on the right in Figui
`re 9
`is 3-colorabIe. Since rearrangeability is no longer
`guaranteed, it may be necessary to consider larger
`“outer" switches while limiting the amount of internal
`rerouting of nets. Another possibility is to
`consider
`the m different alignments of the “outer”
`switches
`with respect to the FPGA pi--
`pins.
`On the other hand non-blocking n
`blocking networks offer a
`guarantee, but require a considerable
`: increase in the
`number of middle switches m. For instance, the best
`widc-sense non-blocking network with broadcasting
`capability requires
`mi > (n - l)logP + 2(n - 1)
`(7)
`for a system with F FPGAs where n is the width
`of each switch in the outer stage [17]. Whereas a
`rearrangeable network only requires
`mj = ri
`(8)
`for point-to-point connections. Given n = 3, and
`F = 16, mi is 12 but mj is 3.
`VI How should other devices be
`connected in the system ?
`Finally we consider the impact of the placement of
`other elements (memory, host interface) in an FPGA-
`
`V,
`
`153
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 7
`
`
`
`outer stage
`Pi
`Pi
`
`middle stage
`
`X
`
`X
`
`X
`
`FPGAj
`
`LOGIC
`
`p6
`
`X Pa
`p9
`
`FPGAj
`
`LOGIC
`
`Pio
`
`P,2
`
`Xp5
`Pi X
`XPi,
`P,3 X
`XP,7
`
`\ Pm
`Pi s
`P,6
`
`LOGIC
`
`P,8
`
`Figure 8: A folded-Clos network.
`
`N,
`
`Nj
`
`W3
`
`We
`
`W5
`
`Ae
`
`W7
`
`Ns
`
`We
`
`w.
`
`FPGA3
`
`N,
`
`/Vj
`
`Ns
`
`W4
`
`if p9 e w5
`if P9 6 w3
`if P,o e w3
`if Pio G Ws
`Figure 9: Contention graphs for the given connections.
`
`169
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 8
`
`
`
`RAM
`
`LOOIC
`
`RAM
`
`>oaic
`
`I
`
`i
`
`'X
`
`X
`
`X
`
`Figure 10: Direct connection between the RAMs and the FPGAs in a folded-Clos network.
`
`LOGIC
`
`LOCK
`
`X
`
`X lalerUc
`
`X
`
`Figure 11: Connection between the Host Interface and the FPGAs in a folded-Clos network.
`
`X
`X
`
`X
`
`Xi
`
`XX
`
`FPGA
`
`LOGIC
`
`FPGA
`
`LOGIC X
`X!
`RAM NX
`
`Figure 12: Connection between the RAMs and the FPGAs in a folded-Clos network.
`
`160
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 9
`
`
`
`[5] C. E. Cox and W. E. Blanz, “GANGLION-
`A Fast Field-Programmable Gate Array Imple
`mentation of a Connectionist Classifier," IEEE
`Journal on Solid-Slate Circuits, vol. 27, pp. 288-
`299, March 1992.
`[6] P. K. Chan, M. Schlag, and M. Martin, “BORG:
`A reconfigurable prototyping board using Field-
`Programmable Gate Arrays,” in Proceedings
`of 1“ International ACM/SIGDA Workshop
`on Field-Programmable Gate Arrays, (Berkeley,
`California, USA), pp. 47-51, Feb. 1992.
`[7] D. A. Thomae, T. A. Petersen, and D. E. Van
`den Bout, “The Anyboard rapid prototyping
`environment,” in Advanced Research in VLSI,
`Proceedings of the 1991 UC Santa Cm Confer
`ence, (Santa Cruz, CA), pp. 356-370, 1991.
`[8] Aptix Corp. 225 Charcot Avenue, San Jose, CA
`95131.
`[9] I-Cube Design Systems. 2328-C Walsh Avenue,
`Santa Clara, CA 95051.
`[10] XILINX: The Programmable Gate Array Data
`Book. 2100 Logic Drive, San Jose, CA 95124,
`1992.
`[11] V. E. Benes, Mathematical Theory of Connecting
`Networks and Telephone Traffic. Academic
`Press, 1965.
`[12] F. K. Hwang, “Control Algorithms for Rear-
`rangeable Clos Networks," IEEE Transactions
`on Communications, vol. COM-31, pp. 952-954,
`Aug. 1983.
`[13] D. Slepian, “Two Theorems on a Particu
`lar Crossbar Switching Network.” Unpublished
`Memo, 1952.
`[14] C. Clos, “A study of non-blocking switching net
`works,” Bell System Technical Journal, vol. 32,
`pp. 406-424, Mar. 1953.
`[15] A. Varma and C. S. Raghavendra, Interconnec
`tion Networks for Multiprocessors and Multicom
`puters: Theory and Practice. IEEE Press, 1993.
`[16] H. N. Gabow and O. Kariv, “Algorithms for
`edge coloring bipartite graphs and multigraphs,"
`SIAM Journal of Computing, vol, II, no. 1,
`pp. 117-129, 1982.
`[17] Y. Yang and G. M. Masson, “Nonblocking
`Broadcast Switching Networks," IEEE Transac
`tions on Computers, vol. C-40, pp. 1005-1015,
`Sept. 1991.
`
`based computing system connected as a folded-Clos
`network. We assume that each memory element
`serves as local memory to the corresponding FPGA.
`There are several possibilities:
`1. Attach the other element (such as memory
`element) directly to the FPGA, as illustrated in
`Figure 10. This involves locking some of the pads
`of the FPGA, and may impair the routability of
`the FPGA. This scheme has the advantage of
`incurring no additional memory access time.
`2. Attach the other element to the middle stage of
`the folded-Clos network, as illustrated in Figure
`11. This scheme incurs higher (one switch delay)
`access delay to the elements, and creates forcea-
`(6]. That is, some nets must go through
`nets
`inc switches. In this case, the control
`speci
`problem is NP-complete, even for m = 2. Yet
`in our experience with our small prototyping
`we have not encountered an
`system BORG [6]
`assignment which could
`3. Treat the other elements as computing elements,
`as illustrated in Figure 12. This scheme incurs
`the largest additional access delay while enabling
`the use of a known polynomial time control
`algorithm. It will also increase the size of
`outer switches to maintain rearrangeability of
`the network.
`VII Conclusions
`We have addressed architectural issues involved in the
`design of an FPGA-based computing system. Some
`of the principles discussed in this paper are being
`applied in the design of a FPGA-based computing
`system at University of California at Santa Cruz. We
`are using the Xilinx XC4010S as the FPG As and IQ160
`as programmable interconnects.
`Acknowledgment
`The authors would like to acknowledge Professor
`Anujan Varma for his discussions on Clos networks.
`The authors are grateful for the generous support of
`Xilinx Inc. and 1-Cube Design Systems.
`
`References
`[1] M. Gokhale, W. Holmes, A. Kopser, S. Lucas,
`R. Minnich, D. Sweely, and D. Lopresti, “Build
`ing and using a highly parallel programmable
`logic array,” Computer, vol. 24, pp. 81-89, Jan.
`1991.
`[2] M. Butts, J. Batcheller, and J. Varghese, An
`Efficient Logic Emulation System, in Proceed
`ings ICCD 92, pg. 138-141, Cambridge MA, Oct,
`1992.
`[3] P. Berlin, D. Roncin, and J. Vuillemin, “Pro-
`ble Active Memories: a Performance
`gramma
`Assessment,” in li! International ACM/SIGDA
`Workshop on Field Programmable Gate Arrays,
`(Berkeley, California), pp. 57-59, February 1992.
`[4] Quickturn Systems Inc. 325 East Middlefield
`Road, Mountain View, CA 94043, 1991.
`
`161
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2078, p. 10
`
`