IPR2020-00262, No. 2006 Exhibit - Ch 2 Parallel Architectures and interconnection networks, by Prof Stewart Weiss (P.T.A.B. May. 6, 2020)

CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Chapter 2 Parallel Architectures and Interconnection
`Networks
`
`The interconnection network is the heart of parallel architecture.
`Feng [1]
`
`- Chuan-Lin and Tse-Yun
`
`2.1
`
`Introduction
`
`You cannot really design parallel algorithms or programs without an understanding of some of the key
`properties of various types of parallel architectures and the means by which components can be connected
`to each other. Parallelism has existed in computers since their inception, at various levels of design. For
`example, bit-parallel memory has been around since the early 1970s, and simultaneous I/O processing (using
`channels) has been used since the 1960s. Other forms of parallelism include bit-parallel arithmetic in the
`arithmetic-logic unit (ALU), instruction look-ahead in the control unit, direct memory access (DMA), data
`pipe-lining, and instruction pipe-lining. However, the parallelism that we will discuss is at a higher level;
`in particular we will look at processor arrays, multiprocessors, and multicomputers. We begin, however, by
`exploring the mathematical concept of a topology
`
`2.2 Network Topologies
`
`We will use the term network topology to refer to the way in which a set of nodes are connected to each
`other. In this context, a network topology is essentially a discrete graph a set of nodes connected by edges.
`Distance does not exist and there is no notion of the length of an edge1. You can think of each edge as being
`of unit length.
`
`In the
`Network topologies arise in the context of parallel architectures as well as in parallel algorithms.
`domain of parallel architectures, network topologies describe the interconnections among multiple processors
`and memory modules. You will see in subsequent chapters that a network topology can also describe the
`communication patterns among a set of parallel processes. Because they can be used in these two dierent
`ways, we will rst examine them purely as mathematical entities, divorced from any particular application.
`Formally, a network topology < S, E > is a nite set S of nodes together with an adjacency relation
`E ⊆ S × S on the set. If v and w are nodes such that (v, w) ∈ E, we say that there is a directed edge
`from v to w.2 Sometimes all of the edges in a particular topology will be undirected, meaning that both
`(v, w) ∈ E and (w, v) ∈ E. Unless we state otherwise, we will treat all edges as undirected. When two nodes
`are connected by an edge, we say they are adjacent. An example of a network topology that should be
`quite familiar to the reader is the binary tree shown in Figure 2.1. Notice that the nodes in that gure are
`labeled . A label is an arbitrary symbol used to refer to the node, nothing more.
`
`We will examine the following topologies in these notes:
`• Binary tree
`• Fully-connected (also called completely-connected)
`• Mesh and torus
`
`1Network topologies are a kind of mathematical topological space, for those familiar with this concept, but this is of no
`importance to us.
`2The reader familiar with graphs will notice that a network topology is essentially a discrete graph.
`
`1
`
`Page 1 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.1: Binary tree topology with 7 nodes.
`
`• Hypercube (also called a binary n-cube)
`• Buttery
`
`2.2.1 Network Topology Properties
`
`Network topologies have properties that determine their usefulness for particular applications, which we now
`dene.
`Denition 1. A path from node n1 to node nk is a sequence of nodes n1, n2, . . . , nk such that, for 1 ≤ i < k,
`ni is adjacent to ni+1. The length of a path is the number of edges in the path, not the number of nodes.
`
`Denition 2. The distance between a pair of nodes is the length of the shortest path between the nodes.
`
`For example, in Figure 2.1, the distance between nodes 3 and 7 is 6, whereas the distance between nodes 3
`and 5 is 2.
`
`Denition 3. The diameter of a network topology is the largest distance between any pair of nodes in the
`network.
`
`The diameter of the network in Figure 2.1 is 4, since the distance between nodes 3 and 7 is 4, and there is no
`pair of nodes whose distance is greater than 4. Diameter is important because, if nodes represent processors
`that must communicate via the edges, which represent communication links, then the diameter determines
`a lower bound on the communication time. (Note that it is a lower bound and not an upper bound; if a
`particular algorithm requires, for example, that all pairs of nodes send each other data before the next step
`of a computation, then the diameter determines how much time will elapse before that step can begin.)
`
`Denition 4. The bisection width of a network topology is the smallest number of edges that must be
`deleted to sever the set of nodes into two sets of equal size, or size diering by at most one node.
`
`In Figure 2.1, edge (1,2) can be deleted to split the set of nodes into two sets {2,3,5} and {1,4,6,7}. Therefore,
`the bisection width of this network is 1. Bisection width is important because it can determine the total
`communication time. Low bisection width is bad, and high is good. Consider the extreme case, in which a
`network can be split by removing one edge. This means that all data that ows from one half to the other
`must pass through this edge. This edge is a bottleneck through which all data must pass sequentially, like a
`one-lane bridge in the middle of a four-lane highway. In contrast, if the bisection width is high, then many
`edges must be removed to split the node set. This means that there are many paths from one side of the set
`to the other, and data can ow in a high degree of parallelism from any one half of the nodes to the other.
`
`Denition 5. The degree of the network topology is the maximum number of edges that are incident to a
`node in the topology.
`
`The maximum number of edges per node can aect how well the network scales as the number of processors
`increases, because of physical limitations on how the network is constructed. A binary tree, for example,
`has the property that the maximum number of edges per node is 3, regardless of how many nodes are in
`
`2
`
`1
`
`2
`
`4
`
`3
`
`5
`
`6
`
`7
`
`Page 2 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`the tree. This is good, because the physical design need not change to accommodate the increase in number
`of processors. Not all topologies have a constant degree.
`If the degree increases with network size, this
`generally means that more connections need to be made to each node. Nodes might represent switches, or
`processors, and in either case they have a xed pin-out, implying that the connections between processors
`must be implemented by a complex fan-out of the wires, a very expensive and potentially slow mechanism.
`Although the edges in a network topology do not have length, we assume that nodes cannot be innitely
`small. As a consequence, the denition of the topology itself can imply that, as the number of nodes
`increases, the physical distance between them must increase. Maximum edge length is a measure of this
`property. It is important because the communication time is a function of how long the signals must travel.
`It is best if the network can be laid out in three-dimensional space so that the maximum edge length is a
`constant, independent of network size. If not, and the edge length increases with the number of processors,
`then communication time increases as the network grows. This implies that expanding the network to
`accommodate more processors can slow down communication time. The binary tree in Figure 2.1 does not
`have a constant maximum edge length, because as the size of the tree gets larger, the leaf nodes must be
`placed further apart, which in turn implies that eventually the edges that leave the root of the tree must get
`longer.
`
`2.2.2 Binary Tree Network Topology
`In a binary tree network, the 2k − 1 nodes are arranged in a complete binary tree of depth k − 1, as in Figure
`2.1. The depth of a binary tree is the length of a path from the root to a leaf node. Each interior node is
`connected to two children, and each node other than the root is connected to its parent. Thus the degree is
`3. The diameter of a binary tree network with 2k − 1 nodes is 2(k − 1), because the longest path in the tree
`is any path from a leaf node that must go up to the root of the tree and then down to a dierent leaf node.
`If we let n = 2k − 1 then 2(k − 1) is approximately 2 log2 n; i.e., the diameter of a binary tree network with
`n nodes is a logarithmic function of network size, which is very low.
`The bisection width is low, which means it is poor. It is possible to split the tree into two sets diering by
`at most one node in size by deleting either edge incident to the root; the bisection width is 1. As discussed
`above, maximum edge length is an increasing function of the number of nodes.
`
`2.2.3 Fully-Connected Network Topology
`
`In a fully-connected network, every node is connected to every other node, as in Figure 2.2. If there are
`n nodes, there will be n(n − 1)/2 edges. Suppose n is even. Then there are n/2 even numbered nodes and
`n/2 odd numbered nodes. If we remove every edge that connects an even node to an odd node, then the
`even nodes will form a fully-connected network and so will the odd nodes, but the two sets will be disjoint.
`There are (n/2) edges from each even node to every odd node, so there are (n/2)2 edges that connect these
`two sets. Not removing any one of them fails to disconnect the two sets, so this is the minimum number.
`Therefore, the bisection width is (n/2)2. The diameter is 1, since there is a direct link from any node to
`every other node. The degree is proportional to n, so this network does not scale well. Lastly, the maximum
`edge length will increase as the network grows, because nodes are not arbitrarily small. (Think of the nodes
`as lying on the surface of a sphere, and the edges as chords connecting them.)
`
`Figure 2.2: Fully-connected network with 6 nodes.
`
`3
`
`Page 3 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`2.2.4 Mesh Network Topology
`
`In a mesh network, nodes are arranged in a q-dimensional lattice. A 2-dimensional lattice with 36 nodes
`is illustrated in Figure 2.3. The mesh in that gure is square. Unless stated otherwise, meshes are usually
`square. In general, there are k2 nodes in a 2-dimensional mesh. A 3-dimensional mesh is the logical extension
`of a 2-dimensional one. It is not hard to imagine a 3-dimensional mesh. It consists of the lattice points
`in a 3-dimensional grid, with edges connecting adjacent points. A 3-dimensional mesh, assuming the same
`number of nodes in all dimensions, must have k3 nodes. While we cannot visually depict q-dimensional mesh
`networks when q > 3, we can describe their properties. A q-dimensional mesh network has kq nodes. k is
`the number of nodes in a single dimension of the mesh. Henceforth we let q denote the dimension of the
`mesh.
`
`Figure 2.3: A two-dimensional square mesh with 36 nodes.
`The diameter of a q-dimensional mesh network with kq nodes is q(k − 1). To see this, note that the farthest
`distance between nodes is from one corner to the diagonally opposite one. An inductive argument is as
`follows. In a 2-dimensional lattice with k2 nodes, you have to travel (k − 1) edges horizontally, and (k − 1)
`edges vertically to get to the opposite corner, in any order. Thus you must traverse 2(k − 1) edges. Suppose
`we have a mesh of dimension q − 1, q > 3. By assumption its diameter is (q − 1)(k − 1). A mesh of one
`higher dimension has (k − 1) copies of the (q − 1)-dimensional mesh, side by side. To get from one corner to
`the opposite one, you have to travel to the corner of the (q − 1)-dimensional mesh. That requires crossing
`(q − 1)(k − 1) edges, by hypothesis. Then we have to get to the kth copy of the mesh in the new dimension.
`We have to cross (k − 1) more edges to do this. Thus we travel a total of (q − 1)(k − 1) + (k − 1) = q(k − 1)
`edges. This is not rigorous, but this is the idea of the proof.
`If k is an even number, the bisection width of a q-dimensional mesh network with kq nodes is kq−1. Consider
`the 2D mesh of Figure 2.3. To split it into two halves, you can delete 6 = 61 edges. Imagine the 3D mesh
`with 216 nodes. To split it into two halves, you can delete the 36 = 62 vertical edges connecting the 36
`nodes in the third and fourth planes. In general, one can delete the edges that connect adjacent copies of
`the (q − 1)-dimensional lattices in the middle of the q-dimensional lattice. There are kq−1 such edges. This
`is a very high bisection width. One can prove by an induction argument that the bisection width when k is
`odd is (kq − 1)/(k − 1). Thus, whether k is even or odd, the bisection width is Θ(kq−1). As there are n = kq
`√
`√
`n.)
`n, which is a very high bisection width. (When q = 2, it is
`nodes in the mesh, this is roughly q
`The degree in a mesh is xed for each given q: it is always 2q. The maximum edge length is also a constant,
`independent of the mesh size, for two- and three-dimensional meshes. For higher dimensional meshes, it is
`not constant.
`
`An extension of a mesh is a torus. A torus, the 2-dimensional version of which is illustrated in Figure 2.4,
`is an extension of a mesh by the inclusion of edges between the exterior nodes in each row and those in each
`column. In higher dimensions, it includes edges between the exterior nodes in each dimension. It is called a
`torus because the surface that would be formed if it were wrapped around the nodes and edges with a thin
`lm would be a mathematical torus, i.e., a doughnut. A torus, or toroidal mesh, has lower diameter than a
`non-toroidal mesh, by a factor of 2.
`
`4
`
`Page 4 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.4: Two-dimensional mesh with toroidal connections.
`
`Figure 2.5: Hypercubes of dimensions 0, 1, 2, and 3.
`
`2.2.5 Hypercube (Binary n-Cube)
`
`A binary n-cube or hypercube network is a network with 2n nodes arranged as the vertices of a n-
`dimensional cube. A hypercube is simply a generalization of an ordinary cube, the three-dimensional shape
`which you know. Although you probably think of a cube as a rectangular prism whose edges are all equal
`length, that is not the only way to think about it.
`To start, a single point can be thought of as a 0-cube. Suppose its label is 0. Now suppose that we replicate
`this 0-cube, putting the copy at a distance of one unit away, and connecting the original and the copy by a
`line segment of length 1, as shown in Figure 2.5. We will give the duplicate node the label, 1.
`We extend this idea one step further. We will replicate the 1-cube, putting the copy parallel to the original
`at a distance of one unit away in an orthogonal direction, and connect corresponding nodes in the copy to
`those in the original. We will use binary numbers to label the nodes, instead of decimal. The nodes in the
`copy will be labeled with the same labels as those of the original except for one change: the most signicant
`bit in the original will be changed from 0 to 1 in the copy, as shown in Figure 2.5. Now we repeat this
`procedure to create a 3-cube: we replicate the 2-cube, putting the copy parallel to the original at a distance
`of 1 unit away in the orthogonal direction, connect nodes in the copy to the corresponding nodes in the
`original, and relabel all nodes by adding another signicant bit, 0 in the original and 1 in the copy.
`It is now not hard to see how we can create hypercubes of arbitrary dimension, though drawing them becomes
`a bit cumbersome. A 4-cube is illustrated in Figure 2.6 though.
`The node labels will play an important role in our understanding of the hypercube. Observe that
`• The labels of two nodes dier by exactly one bit change if and only if they are connected by an edge.
`• In an n-dimensional hypercube, each node label is represented by n bits. Each of these bits can be
`
`5
`
`110
`
`111
`
`10
`
`11
`
`010
`
`011
`
`100
`
`101
`
`0
`
`0
`
`1
`
`00
`
`01
`
`000
`
`001
`
`Page 5 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.6: A 4-dimensional hypercube.
`
`inverted (0->1 or 1->0), implying that each node has exactly n incident edges. In the 4D hypercube,
`for example, each node has 4 neighbors. Thus the degree of an n-cube is n.
`• The diameter of an n-dimensional hypercube is n. To see this, observe that a given integer represented
`with n bits can be transformed to any other n-bit integer by changing at most n bits, one bit at a
`time. This corresponds to a walk across n edges in a hypercube from the rst to the second label.
`• The bisection width of an n-dimensional hypercube is 2n−1. One way to see this is to realize that all
`nodes can be thought of as lying in one of two planes: pick any bit position and call it b. The nodes
`whose b-bit = 0 are in one plane, and those whose b-bit = 1 are in the other. To split the network
`into two sets of nodes, one in each plane, one has to delete the edges connecting the two planes. Every
`node in the 0-plane is attached to exactly one node in the 1-plane by one edge. There are 2n−1 such
`pairs of nodes, and hence 2n−1 edges. No smaller set of edges can be cut to split the node set.
`• The number of edges in an n-dimensional hypercube is n · 2n−1. To see this, note that it is true when
`n = 0, as there are 0 edges in the 0-cube. Assume it is true for all k < n. A hypercube of dimension n
`consists of two hypercubes of dimension n− 1 with one edge between each pair of corresponding nodes
`in the two smaller hypercubes. There are 2n−1 such edges. Thus, using the inductive hypothesis, the
`hypercube of dimension n has 2·(n−1)·2n−2 +2n−1 = (n−1)·2n−1 +2n−1 = (n−1+1)·2n−1 = n·2n−1
`edges. By the axiom of induction, it is proved.
`
`The bisection width is very high (one half the number of nodes), and the diameter is low. This makes the
`hypercube an attractive organization. Its primary drawbacks are that (1) the number of edges per node
`is a (logarithmic) function of network size, making it dicult to scale up, and the maximum edge length
`increases as network size increases.
`
`2.2.6 Buttery Network Topology
`
`A buttery network topology consists of (k + 1)2k nodes arranged in k + 1 ranks, each containing n = 2k
`nodes. k is called the order of the network. The ranks are labeled 0 through k. Figure 2.7 depicts a buttery
`network of order 3, meaning that it has 4 ranks with 23 = 8 nodes in each rank. The columns in the gure
`are labeled 0 through 7.
`We describe two dierent methods for constructing a buttery network of order k.
`
`Method 1:
`• Create k + 1 ranks labeled 0 through k, each containing 2k nodes, labeled 0 through 2k − 1.
`• Let [i, j] denote the node in rank i, column j.
`
`6
`
`0110
`
`0111
`
`1110
`
`1111
`
`0010
`
`0011
`
`1010
`
`1011
`
`0100
`
`0101
`
`1100
`
`1101
`
`0000
`
`0001
`
`1000
`
`1001
`
`Page 6 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.7: Buttery network of order 3.
`
`• For each rank, from 0 through k − 1, connect all nodes [i, j] to nodes [i + 1, j]. In other words, draw
`the straight lines down the columns as shown in Figure 2.7.
`• For each rank, from 0 through k − 1, connect each node [i, j] to node [i + 1, (j + 2k−i−1. This creates
`the diagonal edges that form the buttery pattern. For example, if k = 3, in rank 0, the node in
`column j is connected to the node in rank 1 in column (j + 2k−1)%2k = (j + 4)%8, so the nodes 0,1,2,
`and 3 in rank 0 are connected to the nodes 4,5,6, and 7 respectively in rank 1, and nodes 4,5,6, and 7
`in rank 0 are connected to nodes 0,1,2, and 3 respectively in rank 1.
`
`Method 2: This is a recursive denition of a buttery network.
`• A buttery network of order 0 consists of a single node, labeled [0, 0].
`• To form a buttery network of order k + 1, replicate the buttery network of order k, labeling the
`corresponding nodes in the copy by adding 2k to their column numbers. Place the copy to the right of
`the original so that nodes remain in the same ranks. Add one to all rank numbers and create a new
`rank 0. For each j = 0, 1, , ..., 2k+1, connect the node in column j of the new rank 0 to two nodes: the
`node in rank 1, columnj and the corresponding node in the copy.
`
`This recursive sequence is illustrated in Figures 2.8 and 2.9.
`
`Figure 2.8: Recursive construction of buttery network of order 1 from order 0.
`
`Figure 2.9: Recursive construction of buttery network of order 2 from order 1.
`
`It can be shown that there is a path from any node in the rst rank to any node in the last rank. If this is
`the case, then the diameter is 2k: to get from node 0 to node 7 in the rst rank requires descending to rank
`3 and returning along a dierent path. If, however, the last rank is really the same as the rst rank, which is
`
`7
`
`000
`
`001
`
`010
`
`011
`
`100
`
`101
`
`110
`
`111
`
`Rank 0
`
`Rank 1
`
`Rank 2
`
`Rank 3
`
`000
`
`001
`
`010
`
`011
`
`100
`
`101
`
`110
`
`111
`
`0
`
`copy
`
`0
`
`1
`
`create
`new rank
`
`0
`
`1
`
`connect
`new rank
`
`0
`
`1
`
`copy
`
`0
`
`1
`
`00
`
`01
`
`10
`
`11
`
`create and connect
`new rank
`
`00
`
`01
`
`10
`
`11
`
`Page 7 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.10: Buttery network with rst and last ranks representing the same node.
`
`sometimes the case, then the diameter is k. Figure 2.10 schematically represents this type of network with
`dashed lines between the rst and last ranks.
`The bisection width is 2k. To split the network requires deleting all edges that cross between columns
`2k−1 − 1 and 2k − 1. Only the nodes in rank 0 have connections that cross this divide. There are 2k nodes
`in rank 0, and one edge from each to a node on the other side of the imaginary dividing line. This network
`topology has a xed number of edges per node, 4 in total; however the maximum edge length will increase
`as the order of the network increases.
`
`in Figure 2.7, imagine taking each column and enclosing it in a box. Call each
`One last observation:
`box a node. There are 2k such nodes. Any edge that was incident to any node within the box is now
`considered incident to the new node (i.e., the box). The resulting network contains 2k nodes connected in a
`k-dimensional hypercube. This is the relationship between buttery and hypercube networks.
`
`2.3
`
`Interconnection Networks
`
`We now focus on actual, physical connections. An interconnection network is a system of links that
`connects one or more devices to each other for the purpose of inter-device communication. In the context
`of computer architecture, an interconnection network is used primarily to connect processors to processors,
`or to allow multiple processors to access one or more shared memory modules. Sometimes they are used to
`connect processors with locally attached memories to each other. The way that these entities are connected
`to each other has a signicant eect on the cost, applicability, scalability, reliability, and performance of a
`parallel computer. In general, the entities that are connected to each other, whether they are processors or
`memories, will be called nodes.
`
`An interconnection network may be classied as shared or switched. A shared network can have at most one
`message on it at any time. For example, a bus is a shared network, as is traditional Ethernet. In contrast, a
`switched network allows point-to-point messages among pairs of nodes and therefore supports the transfer
`of multiple concurrent messages. Switched Ethernet is, as the name implies, a switched network. Shared
`networks are inferior to switched networks in terms of performance and scalability. Figure 2.11 illustrates a
`shared network. In the gure, one node is sending a message to another, and no other message can be on
`the network until that communication is completed.
`
`Figure 2.11: A shared network connecting 4 nodes.
`
`8
`
`000
`
`001
`
`010
`
`011
`
`100
`
`101
`
`110
`
`111
`
`Rank 0
`
`Rank 1
`
`Rank 2
`
`Rank 3
`
`000
`
`001
`
`010
`
`011
`
`100
`
`101
`
`110
`
`111
`
`Node
`
`Node
`
`Node
`
`Node
`
`Page 8 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.12 depicts a switched network in which two simultaneous connections are taking place, indicated
`by the dashed lines.
`
`Figure 2.12: A switched network connecting 4 nodes.
`
`2.3.1
`
`Interconnection Network Topologies
`
`Notation. When we depict networks in these notes, we will follow the same convention as the Quinn book
`and use squares to represent processors and/or memories, and circles to represent switches. For example,
`in Figure 2.13, there are four processors on the bottom row and a binary tree of seven switches connecting
`them.
`
`We distinguish between direct and indirect topologies. In a direct topology, there is exactly one switch
`for each processor node, whereas in an indirect topology, the number of switches is greater than the number
`of processor nodes. In Figure 2.13, the topology is indirect because there are more switches than processor
`nodes.
`
`Figure 2.13: Binary tree interconnection network. The circles are switches and the squares are processors.
`
`Certain topologies are usually used as direct topologies, others as indirect topologies. In particular,
`• The 2D mesh is almost always used as a direct topology, with a processor attached to each switch, as
`shown in Figure 2.14.
`• Binary trees are always indirect topologies, acting as a switching network to connect a bank of proces-
`sors to each other, as shown in Figure 2.13.
`• Buttery networks are always indirect topologies; the processors are connected to rank 0, and either
`memory modules or switches back to the processors are connected to the last rank.
`• Hypercube networks are always direct topologies.
`
`2.4 Vector Processors and Processor Arrays
`
`There are two essentially dierent models of parallel computers: vector computers and multiprocessors. A
`vector computer is simply a computer that has an instruction that can operate on a vector. A pipelined
`
`9
`
`Node
`
`Node
`
`Switched Network
`
`Node
`
`Node
`
`4
`
`Page 9 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.14: 2D mesh interconnection network, with processors (squares) attached to each switch.
`
`vector processor is a vector processor that can issue a vector instruction that operates on all of the elements
`of the vector in parallel by sending those elements through a highly pipelined functional unit with a fast
`clock. A processor array is a vector processor that achieves parallelism by having a collection of identical,
`synchronized processing elements (PE), each of which executes the same instruction on dierent data,
`and which are controlled by a single control unit. Because all PEs execute the same instruction at the same
`time, this type of architecture is suited to problems with data parallelism. Processor arrays are often used
`in scientic applications because these often contain a high amount of data parallelism.
`
`2.4.1 Processor Array Architecture
`
`In a processor array, each PE has a unique identier, its processor id, which can be used during the compu-
`tation. Each PE has a small, local memory in which its own private data can be stored. The data on which
`each PE operates is distributed among the PEs' local memories at the start of the computation, provided
`it ts in that memory. The control unit, which might be a full-edged CPU, broadcasts the instruction to
`be executed to the PEs, which execute it on data from its local memory, and can store the result in their
`local memories or can return global results back to the CPU. A global result line is usually a separate,
`parallel bus that allows each PE to transmit values back to the CPU to be combined by a parallel, global
`operation, such as a logical-and or a logical-or, depending upon the hardware support in the CPU. Figure
`2.15 contains a schematic diagram of a typical processor array. The PEs are connected to each other through
`an interconnection network that allows them to exchange data with each other as well.
`If the processor array has N PEs, then the time it takes to perform the same operation on N elements is
`the same as to perform it on one element. If the number of data items to be manipulated is larger than N ,
`then it is usually the programmer's job to arrange for the additional elements to be stored in the PEs' local
`memories and operated on in the appropriate sequence. For example, if the machine has 1024 PEs and an
`array of size 5000 must be processed, then since 5000 = 4 · 1024 + 4, the 5000 elements would be distributed
`among the PEs by giving 4 elements to 1020 PEs and 5 to 4 PEs.
`The topology of the interconnection network determines how easy it is to perform dierent types of compu-
`tations. If the interconnection network is a buttery network, for example, then messages between PEs will
`travel many links in each communication, slowing down the computation greatly. If the machine is designed
`for fast manipulation of two-dimensional data sets, such as images or matrices, then the interconnection
`network would be a 2D mesh arranged as a direct topology, as shown in Figure 2.14.
`
`2.4.2 Processor Array Performance
`
`Although the subject of parallel algorithm performance will be covered in depth in a later chapter, we
`introduce a measure of computer performance here, namely the number of operations per second that a
`computer can execute. There are many other metrics by which a computer's performance can be evaluated;
`this is just one common one.
`
`10
`
`Page 10 of 17 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2006
`
`

`CSci 493.65 Parallel Computing
`Chapter 2 Parallel Architectures and Interconnection Networks
`
`Prof. Stewart Weiss
`
`Figure 2.15: Typical processor array architecture.
`
`If a processor array has 100 processing elements, and each is kept busy all of the time and executes H
`operations per second, then the processor array executes 100H operations per second. If, at any moment
`in time, on average, only a fraction f of the PEs are active, then the performance is reduced to 100f H
`operations per second. The average fraction of PEs that are actively executing instructions at any time is
`called the utilization of the processor array.
`
`Several factors inuence this utilization. One, as alluded to above, is whether or not the number of data
`elements to be processed is a multiple of the number of PEs. If it is not, then at some point there will be
`idle PEs while others are active.
`
`Example 6. A processor array has 512 PEs. Two arrays A and B of 512 oating point numbers each is to
`be added and stored into a third array C. The PEs can execute a oating point addition (including fetching
`and storing) in 100 nanoseconds. The performance of this processor array on this problem would be
`
`512 operations
`512 operations
`100 × 10−9 seconds
`100 nanosecs
`The term ops is short for oating-point operations per second.
`
`=
`
`= 5.12 × 109 f lops.
`
`Suppose the size of the array is 700, instead. In this case the rst 512 elements will be executed in parallel
`by all PEs, taking 100 nanoseconds in total, but the remaining 188 elements will require only 188 PEs to be
`active, and 100 nanoseconds will elapse. Therefore the performance will be
`= 3.5 × 109 f lops.
`
`3.5 operations
`10−9 seconds
`
`700 operations
`200 nanosecs
`
`=
`
`This is less than 70% of peak performance.
`
`2.4.3 Processor Masking
`
`A processor array also has to handle what PEs do when they do not need to participate in a computation,
`as in the preceding example. This problem also arises when conditional instructions such as if-then-else
`instructions are executed. In this case, it is possible that the true-branch will need to be taken in some PEs
`and the false-branch in others. The problem is that the control unit issues a single instruction to all PEs
`
`11
`
`Front−end computer
`
`CPU
`
`Memory
`
`I/O Processors
`
`I/O devices
`
`Processor array
`
`Memory bus
`
`Global result bus
`
`Instruction broadcast bus
`
`PE
`
`PE
`
`PE
`
`PE
`
`PE
`
`PE
`
`PE
`
`PE
`
`M
`
`M
`
`M
`
`M
`
`M
`
`M
`
`M
`
`M
`
`Interconnection network
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases