`
`Archltcctura for the Continuum Model-Which Direction?
`
`i85
`
`ess of near-neighbor connections, but the utility of such devices is limited to
`~he calculations that fit the specific array geometries. Nevertheless, such devices
`could be put in high production and swing the cost-performance pendulum
`to
`favor near-neighbor communication. On the other hand, breakthroughs
`in op(cid:173)
`tical transmission and optical switching may swing the balance towards
`the
`erfect shuffle. Such advances would make communication faster over the longer
`iterconnections,
`and reduce the cost of sending data much further than the
`nearest neighbor.
`Consequently, new advances in architectures for the continuum model are
`driven by the advances yet to come in devices and communications.
`
`4.6 Architectures for the Continuum
`Model-Which Direction?
`The continuum model is a natural model for parallelism. Near-neighbor inter·
`actions can be modeled by networks of processors connected together as near(cid:173)
`neighbors. The advantage of the near-neighbor structure is very strong for those
`problems that are ideally matched to such a structure.
`In a broad spectrum of problems, as the fit becomes less ideal, the perfor(cid:173)
`mance of near-neighbor connections becomes poorer and poorer, to the extent
`that gains due to parallel execution are offset by the inefficient use of hardware.
`Here are the basic choices available to the architect:
`
`that is very fast and
`1. Build a highly specialized, near-neighbor architecture
`effective for some class of problems within the continuum model.
`2. Build a somewhat more general machine, but maintain high speed for the
`continuum model. Provide extra capability through richer interconnections,
`such as the perfect shuffle, and through other mechanisms
`that provide
`speed enhancement
`for problems that fall outside the continuum model.
`3. Build a very general parallel machine that has broad applicability, including
`the continuum model, although
`its speed for continuum calculations may
`not be as high as for an architecture specialized for the class of problems.
`
`increases by one to two orders of
`The potential size of the user community
`magnitude as you move from the first to the second choice, and again as you
`move from the second to the third choice. A large user base tends to provide
`cost reductions
`to each user because they have to support a much smaller share
`of the hardware and software development costs.
`A large demand also provides greater profit motivation, but if a designer
`chooses to serve the large community and produces a fairly general architecture,
`the users who absolutely need a machine for the continuum model will ~
`unsatisfied if the general architecture is significantly slower than an architecture
`specialized for the continuum model. Moreover, this same user group will ques-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 300
`
`
`
`Characteristics of Numerical Applications
`
`Chapter 4
`
`that support th
`tion the value to themselves of the hardware and software
`more general classes of problems, since this gro~p of us~rs m~y be paying fo:
`these aspects of the computer system and yet denve no discernible benefit from
`them.
`Which community should the architect serve? There is no obvious answe
`to this question. The architect should be prepared
`to build any of the possibl;
`machines, from the most specialized to the most general, each optimized for
`the best possible cost and performance for that architecture.
`Market forces and other priorities will dictate which machine actually gets
`built. Some developers will choose the most general approach, and hope to
`install many copies of a machine. Some developers will choose to a carve a niche
`for their ideas by producing a relatively small number of ~opies of a highly
`specialized machine. Yet other developers may choose a design that falls in be-
`tween.
`Whichever choice is made, the architecture has to be cost-effective for the
`user community. For the smaller markets, a significant portion of the challenge
`is to keep hardware and software development costs low✓ so that these costs
`when amortized over copies actual1y sold are still within reasonable bounds.
`Thus; not only must the architect produce a cost -effective design, but the design
`process itself must be done efficiently.
`to be an
`One important observation from this chapter is that what appears
`ideal architecture for a class of problems may not be ideal at all. An architect
`who produces a machine that executes a particular code very efficiently may be
`somewhat disappointed when research advances in basic algorithms produce a
`new, efficient solution technique not at all suited to the specific architecture. In
`such a case the very specialized machine may have difficulty competing with
`a less specialized machine that happens
`to be able to run the more efficient
`algorithm.
`Breakthroughs do occur from time to time, such as with the formulation of
`the fast Fourier transform [Cooley and Tukey 1965]. Prior to their work, the best
`algorithm required N2 multiplications and required a particular
`type of access
`to data. The newer; faster algorithm requires only N log N multiplications and
`uses a very different data flow. A machine built for the older algorithm would
`not serve the newer one well. The more specialized
`the architecture,
`the more
`susceptible
`it is to competitive methods when breakthroughs
`do occur. The
`the risk of a breakthrough.
`architect of the specialized machine has to assess
`For the continuum model, the risks are high enough
`to merit attention.
`In recent years, algorithm
`improvements have changed
`the basic flow of
`data in various solution techniques, have altered the grid structure
`that models
`the continuum, and have even provided for multiple grid spacing. A machine
`built specifically for algorithms of 20 years ago would do relatively poorly when
`executing some of the new algorithms for the same problems.
`As an example of the evolution of parallel algorithms,
`the fast algorithms
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 301
`
`
`
`section 4_6 Architectures for the Conttnuum Model-Which Direction?
`
`H7
`
`f r the continuum model described earlier in this chapter may make better use
`~ connection patterns
`like the perfect shuffle than of connection patterns that
`~e near-neighbor mes~ connections, but the near-neighbor connections were
`the backbone for_ th~ first large-scale computers for the continuum. Another
`step in the e~olu?on is represente~ by the Cosmic ~ube described earlier in this
`interconnections and
`chapter, which 1n a sense comb1~es the near-neighbor
`the perfect shuffle. It uses_near-~e1ghbor connections in six dimensions, but at
`best only three of those dimensions
`lead to short interconnections
`in a three(cid:173)
`dimensional packaging world. The other three dimensions force interconnec(cid:173)
`tions to have relatively long physical lengths.
`The six-dimensional connection structure of the Cosmic Cube gives the same
`adjacency pattern _achieved by. the ~erfect sh~ffle. The difference is that all
`dimensions are ad1acent at all times m a Cosmic Cube, whereas the adjacency
`changes in time in a perfect shuffle structure. Because processors that are directly
`connected within a Cosmic Cube have indices that differ by a single power of
`2, this structure
`is well suited for recursive doubling, cyclic reduction, Fourier
`transforms, and other applications mentioned in this section.
`Hoshino [1989], on the other hand, has shown that for the general class of
`scientific calculations the overwhelming majority of processor-to-processor
`in(cid:173)
`teractions occur across near-neighbor
`links on a two-dimensional mesh. The
`additional connectivity provided by a hypercube and the greater distances
`spanned by the perfect shuffle rarely come into play, and provide only a marginal
`decrease in the number of operations while contributing greatly to cost. He
`provides a strong case for two-dimensional mesh connections based on extensive
`experience in implementing scientific applications. Even though his applications
`occasionally force some processors
`to communicate over long distances, this
`happens sufficiently
`infrequently
`that it degrades performance only slightly.
`Hence, Hoshino's case rests on the fact that the communication constraints
`imposed by a two-dimensional mesh do not degrade performance of actual
`programs. Indeed, his PAX architecture is a compromise between the ILUAC
`JV and Cosmic Cube architectures,
`incorporating some good features of each
`together with some features unique to PAX.
`Nevertheless, experience with parallel applications is still rather limited but
`growing every year. New techniques and new algorithms are still appearing in
`abundance. As these appear, they force us to rethink our conclusions on what
`combination of algorithms, architecture, applications and produces an efficient
`way to solve problems.
`In summary,
`there is no obvious best design for parallel processors for the
`continuum model. The available approaches depend on how specialized the
`processing system can be. A processor for the continuum model undoubtedly
`will be somewhat specialized-it will probably have an interconnection system
`to speed up typical programs for this model. Which approach, if any, becomes
`dominant is most likely to depend on the directions of device technology in the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 302
`
`
`
`Characteristics of Numerical Applications
`
`Chapter 4
`
`coming years, with near-neighbor structures depe:11dent on V~Sl advances and
`perfect-shuffle structures dependent on advances m interconnections technology.
`
`Exercises
`4.1 The object of this ex·ercise is to explore calculations for the con ti.nu um model. Assume
`that you have a square array of points, 9 x 9, and ~hat the value of the potential
`function on the boundary is O on the top row, and 1s 10 along all other boundary
`points.
`a) Initialize the potential function to O on all interior points. Calculate the Poisson
`solution for the values of all interior points by replacing each interior point with
`the new values
`the average value of each of its neighboring points . Compute
`for all interior points before updating any interior points. Run this simulation
`for five iterations and show the answers you obtain at this point. Then run until
`no interior point changes by more than 0.1 percent, and count the total number
`of iterations until convergence. This method is usually called the Jacobi metltod.
`Note: The values on the boundary are fixed and do not change during the
`computation.
`b) Repeat the process in the previous problem, except update a point as soon as
`you have computed the new value and use the new value when you reach a
`neighboring point. You should scan the interior points row by row from top to
`bottom and from left to right within rows. This method
`is usually called the
`Gauss-Seidel method.
`c) The second process seems to converge faster. Give an intuitive explanation of
`why this might be the case.
`d) How do your findings relate to the interconnection
`cessor designed to solve this problem?
`4.2 The purpose of this exercise is to show the effect of information propagation within
`a calculation. Use the Poisson problem of Exercise 4. l(b) and write a computer
`program using the Gauss-Seidel method that iterates until no interior point value
`changes by more than 0.1 percent. Let this be the initial state of the problem for
`the following exercises.
`a) Increase the boundary point on the top row next to the upper left comer to a
`new value of 20. Perform five iterations of the Gauss-Seidel Poisson solver and
`observe the values obtained. Then run the algorithm until no interior point value
`changes by more than 0.1 percent and count the total number of iterations to
`reach this point.
`b) Now restore the mesh to the initial state for a. Change the program so that, in
`effect, the upper left corner is rotated to the bottom right corner. To do this,
`scan the rows from right to left instead of left to right and scan from bottom to
`top instead of from top to bottom. Perform five iterations of the Poisson solver
`and observe the values obtained. Run the program until no interior point changes
`by more than 0.1 percent, and count the number of iterations to reach this point.
`
`structure of a parallel pro(cid:173)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 303
`
`
`
`Exercises
`
`289
`
`c) Both a and b eventually converge to the same solution because the initial data
`are the same and the physical process modeled is the same. However, the results
`obtained
`from a and 11 are different after five iterations. Explain why they are
`different. Which of the two problems has faster convergence? Why?
`4.3 The purpose of this exercise is to examine the cyclic-reduction algorithm. Explore
`the solution of a one-dimensional Poisson problem by treating 15 points on a line.
`Let the left boundary point, point 0, have the value 10 and the right boundary
`point, point 16, have the value 0. Each of the 15 intem1ediate points has a value
`that is the average of its immediate neighbors.
`a) Write a matrix equation of the form Ax = b that describes this problem.
`b) Simulate an iterative process that updates each interior point with the average
`of its neighboring points. Obtain the interior values of points for the first three
`iterations of the technique previously used, in which each interior point is up•
`dated bv the average of its neighbors.
`c) Now apply the cyclic-reduction algorithm in the text for three iterations to find
`one equation
`for the point in the middle. Solve this equation and use three
`iterations of back substitution
`to find the remainder of the points. Show your
`solution and the equations you obtain after each iteration. (Hint: The first iteration
`should produce new equations for points 2, 4, 6, 8, 10, 12, and 14. The second
`iteration produces new equations for 4, 8, and 12.)
`d) Compare
`the results produced
`in band c with respect to the precision obtained.
`Count and compare
`the total number of additions, multiplications, <1.nd divisions
`for each algorithm after three iterations .
`e) Explain from an intuitive point of view why cycl~c reduction yields high speed
`and high precision as compared to the near-neighbor
`iteration. What implications
`can you draw with regard
`to interconnections
`for processors for solving the
`Poisson problem?
`4.4 The purpose of this exercise is to investigate how to implement conditional branches
`if
`in an array computer. Program 4.1 does not show instructions
`that determine
`convergence has been reached. The instructions should determine if every processor
`has obtained a satisfactory solution, and, if not, the progr.am should branch back
`to the top of the loop .
`the instructions as you need
`that do this job, inventing
`a) Write the instructions
`that you invent.
`them. Describe the operation of each instruction
`b) Redraw the block diagram of the ILLIAC IV computer and describe the data flow
`on the block diagram necessary
`to support
`the test for termination.
`c) Assume
`that the control processor of the lLLlAC IV can execute its instructions
`to the 64 numerical processor:..
`in parallel with instructions
`that are broadcast
`test be overlapped with the cal(cid:173)
`Can any or all instructions of the termination
`culation of a loop iteration? If so, describe ho\'\' to implement
`the instructions
`in
`your program and in Program 4.1 to facilitate this overlapped execution.
`4.5 The purpose of this exercise is to explore the interconnection structure of a hypercube
`that you are to cakulate all partial
`computer such as the Cosmic Cube. Assume
`sums of i items up to the sum of 64 items.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 304
`
`
`
`190
`
`Characteristics of Numerical Appllcattons
`
`Chapter 4
`
`a) Construct a program for a Cosmic Cilbe computer system that performs th·
`in a time that grows as O(log N) if the number of processors is ~s
`operation
`Assume that every node in the computer executes the same program, althouJ h
`the program can be slightly different from node to node since the processors~
`a Cosmic Cube are independent. Show ex~licitly t~e instructions
`that send an~
`receive data between pr~cessors. Invent mstructions ~s you ~eed them and
`describe what the instructions do. In~ude so~e typ_e of m~truchon for synchro(cid:173)
`nization that forces a processor to be idle until a ne1ghbonng processor sends
`a
`message or a datum that enables computation
`to continue.
`b) Which communication steps if any in your answer require communications with
`processors that are not among the six processors directly connected to a given
`processor? How do you propose to implement such communication
`in software
`{assuming that the hardware itself does not provide remote communication as
`a basic instruction)?
`4.6 The purpose of this exercise is to examine the recursive-doubling solution to a linear
`tridiagonal system of equations. Consider the solution of the equation Ax = b,
`where A is a tridiagonal equation.
`a) Prove that the recurrence in Eq. ( 4.15) is a correct expression for the major
`diagonal of matrix U in an LU decomposition of A.
`b) Using recursive doubling, show all of the steps required to factor A into LU and
`to solve the equations Ly= band Ux = y. For each major step of the algorithm,
`show the basic recurrence solution. Show the mathematical formulation of your
`solution and indicate the basic operation in the recursive-doubling
`iteration.
`
`4.7 Find a recursive-doubling technique for solving Eq. (4.13).
`
`4.8 The purpose of this exercise is to explore some of the properties of the perfect(cid:173)
`shuffle interconnection scheme.
`a) Consider a processor that has the perfect shuffle and pair-wise exchange con(cid:173)
`nections shown in Fig. 4. 16. For an eight~processor system, show that the per(cid:173)
`mutation that cyclically shifts the input vector by three positions is realizable by
`some setting of the exchange modules. Draw the network unrolled in time to
`show the setting that realizes this permutation.
`b) Repeat a to show that a cyclical shift of two positions is realizable.
`c) Prove that a shuffle-exchange network can realize any cyclical shift in log2 N
`iterations for an N-processor system when N is a power of 2.
`
`4.9 Find a means for evaluating a polynomial of degree N - 1 in the variable x in
`parallel on an N-processor computer that uses the shuffle-exchange
`interconnection
`pattern. Assume that N is a power of 2.
`
`4.10 Prove that the scheme shown in Fig. 4.18 produces a sorted sequence of length N
`from a bitonk sequence of length N. Specifically, prove that after the comparison
`and exchange is performed, each sequence of length N/2 is bi tonic and all elements
`of one sequence do not exceed the value of any element of the other sequence.
`4.11 Consider a tridiagonal linear system such as that described in Section 4.4.4. Assume
`that the problem is symmetric about the major diagonal so that ai.i = aj.i· (The indices
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 305
`
`
`
`Exercises
`
`291
`
`r)Qi_ 1(x) -
`
`(a;_;_1)2Q;. :?.(x)
`
`; and j lie in the range 1 :S i, j s N, where N is a power of 2.) The Sturm polynomials
`for the matrix A are the polynomials Q,(x) defined by the recurrence
`Q,{x) = (au -
`Qo(X) = 1
`Q1(X) = a1,1 - X
`it is necessary to find the number ol
`For a very important matrix computation
`changes of sign for a given value of x in the sequence of values Q0{x), Q1(x), .
`.
`. ,
`QN-1(X).
`a) Assume that you wish to find the number of sign changes for a single value of
`x, and you have N processors available to do the calculation in parallel. Work
`out a recursive-doubling algorithm for the calculation.
`b) Show a block diagram of a connection pattern suitable for this algorithm within
`which each processor
`is connected
`to a fixed constant number of processors
`regardless of the size of N, and for which at each step of the algorithm the data
`are accessible in a constant number of steps from neighboring processors, re(cid:173)
`gardless of the size of N.
`c) Now assume that you wish to find the number of sign changes for N different
`values of x. Compare the time taken by running your recursive-doubling algo(cid:173)
`rithm N times to the time required to obtain values of the Sturm polynomials
`serially for each of the N values of x. Which of the two methods is preferred?
`d) Now assume that you wish to compute the number of sign changes for a number
`of values of x much larger than the value of N. Which of the two methods is
`better?
`4.12 Figure 4.15 shows a shuffle-exchange nen-.•ork with a cyclical shift interconnection
`pattern superimposed. Show that it is possible to compute the same set of partial
`sums computed in the figure without the cyclic-shift pattern, using only the perfect(cid:173)
`shuffle and the pair-wise-exchange patterns of Fig. 4.16. Your algorithm will need
`to send more than one datum from one cell to a cell in the next column, but the
`number of different data transmitted from column to column is a constant that is
`independent of N.
`4.13 a) Show the switch settings for a shuffle-exchange network as depicted in Fig. 4.16
`that send input cell i to output cell 3i mod N for N = 16.
`b) For each integer in the range O s i s 15, write the value of i in binary followed
`by the value of 3i mod 16 in binary . Start a new row for each integer and align
`the binary values to create a table of size 16 rows by 8 columns. Examine row i
`for each i. Show that the last four bits in each row are related to the switch
`settings from part a. In fact, these bits show the switch settings for input i as it
`passes through
`the neh-vork. (Hint : Use the shift-register analogy.)
`that takes i into pi mod N for i ::S N is a permutation
`4.14 a) Prove that the function
`when N is a power of 2 and pis odd. (Hint: The function is a permutation if you
`can show that when pi = pj mod N, this implies that i = j.}
`b) Apply your reasoning
`from b of Exercise 4.13 to show that a shuffle-exchange
`that takes i to pi mod
`network has switch settings
`that realize the permutation
`N for every odd value of p.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 306
`
`
`
`5
`
`The tucked-up sempstress walks with
`hasty strides, Wh!le ~treams run down
`her oil' d umbrella s sides.
`-Jonathan Swift, 1711
`
`Vector Computers
`
`5.1 A Generic Vector Processor
`5.2 Access Patterns for Numerical Algorithms
`.
`5.3 Data-Structuring Techniques for Vector Machines
`5.4 Attached Vector-Processors
`5.5 Sparse-Matrix Techniques
`5.6 The GF-11-A Very High-Speed Vector Processor
`5. 7 Final Comments on Vector Computers
`
`The last chapter introduces the idea of building a parallel architecture matched
`to a specific class of problems. The discussion there mentions that there are two
`major models of numerical processes-a continuum model based on near-neigh(cid:173)
`bor interactions and a particle model based on discrete point-to-point interac(cid:173)
`tions. The major emphasis of Chapter 4 is the continuum model, together with
`the architectures that support processing of near-neighbor interactions for that
`model.
`This chapter extends the discussion of numerical architectures
`to vector
`computers with the idea that these computers can be used for the majority of
`continuum-model problems, as well as for many particle-model problems. The
`vector computer has emerged as the most important high-performance archi(cid:173)
`tecture for numerical problems. It has the two key qualities of efficiency and
`wide applicability.
`Most vector computers have a pipelined structure. When one pipeline is not
`sufficient to achieve desired performance, designers have occasionally provided
`
`292
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 307
`
`
`
`section 5.1
`
`A Generic Vector Processor
`
`293
`
`multiple pipelines. Such processors not only support a streaming mode of data
`they also support fully parallel operation by
`flow through a single pipeline,
`allowing multiple pipelines to execute concurrently on independent
`streams of
`data.
`By the mid-1980s, more than twenty manufacturers offered vector processors
`based on pipeline arithmetic units. They ranged from relatively inexpensive
`auxiliary processors attached to microcomputers
`to high-speed supercomputers
`with computation
`rates from 100 Mflops to rates in excess of 1000 Mflops. (One
`Mflops is 106 floating-point op~rations per second.)
`The price-performance
`ratio of these vector processors is rather remarkable
`because they yield one to two orders of magnitude
`increased throughput
`for
`vector computations when compared
`to serial processors of equal cost. But this
`/throughput increase is limited to the problems that fit the architecture-that
`is,
`to problems
`that can be structured as a sequence of vector operations whose
`characteristics make efficient use of the facilities available.
`Many of the supercomputers
`are also high-performance serial processors for
`general-purpose problems, but the throughput of these supercomputers on non(cid:173)
`is only a few times greater than the throughput of more con(cid:173)
`vector problems
`ventional high-speed
`serial processors.
`In fact, although
`throughput might be
`if a vector-structured
`high because of fast device technology,
`supercomputer
`is
`the computational cost may be exces(cid:173)
`used exclusively on nonvector problems,
`sive because this cost includes the cost of the vector facilities, which presumably
`are left idle by scalar computations.
`The purpose of this chapter is to describe the general architecture of vector
`machines and then describe how algorithms and architecture can be matched
`to each other to obtain efficient processing over large classes of computations.
`
`5.1 A Generic Vector Processor
`The basic idea of a vector processor
`is to combine two vectors, element by
`element, to produce an output vector. Thus, if A, B, and C are vectors, each
`with N elements, a vector processor can perform the operation
`
`-
`which is in'ff~:ed
`
`C :=A+
`
`B
`
`to mean
`ci : = ai + b;, 0 ~ i < N - 1
`where the vector C can be written in component
`form is similar for vectors A and B.
`A very simplified way to implement
`this operation with a pipelined arith(cid:173)
`metic unit is shown in Fig. 5.1. The two streams of data supplied to the arithmetic
`unit carry the streams for A and B, respectively. The memory system suppli~s
`_:-_.:, .,..
`
`form as (co, C1, •.•
`
`, cN_1). The
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 308
`
`
`
`294
`
`Vector Computers
`
`Chapter 5
`
`Stream A
`
`Stream B
`
`-~
`
`...
`
`~
`
`Stream C :;;;;: A + B
`
`~
`
`~
`
`Pipelined Adder
`
`Mullipon
`Memory
`System
`
`I
`
`Fig. 5.1 A processor that is capable of adding two vectors by streaming the two vectors
`through a pipelined adder.
`
`..;,
`
`(
`
`,/;..,..
`
`, • ..i,.-
`-
`
`one element of A and B on every clock cydeJ one element to each input stream.
`The arithmetic unit produces one output value during each dock cycle. (Actually,
`the input data rate need be only as fast as the output data rate. If the arithmetic
`unit can produce results at a rate of one output value every d cycles, then the
`input data rate need be only one input value on each stream every d cycles.)
`-2>/ Figure 5.1 shows only the 9ar~st details of the vector processor to indicate
`· the general flow of data through the pipelines. The pipelined arithmetic unit 15
`discussed in Section 3.4 and that unit is the core of the architecture in Fig. 5.1.
`The difficulty, however, is the design of the memory system to sustain a
`continuous flow of data from memory to the arithmetic unit and the retuin fiow
`of results from the arithmetic unit to memory. The majority of the architectural
`tricks used in vector processors are devo_ted to sustaining that flow of data and
`~,1, ..,.i.....,.,
`to scheduling sequences of operations to reduce the flow requirements.
`/
`{In this example we assume a basic one-cycle rate for the deljv..ery of operands,
`production of results, and restoring of the result data into m;mory. This calls
`for a memory system that can read two operands and write one operand in a
`single cycle}
`c,,,'//.,'.;Conventional random-access memories can perform at most one READ or
`one WRITE per cycle✓ so the memory system in Fig. 5.1 has at least three times
`the bandwidth of a conventional memory system. Of course this ignores any
`additional requirement
`for bandwidth for input/output operations. Also, we
`have ignored the bandwidth for instruction fetches✓ but a major advantage of a
`vector architecture is that a single instruction fetch can initiate a very long vector
`operation. Consequently,
`the bandwidth required
`to fetch instructions for a
`vector architecture
`is negligible as compared to the 20 to 50 percent of the
`bandwidth used for instruction fetches in conventional architectures.
`The major problem f~cing the architect is to design a memory system that
`
`C) ,-_/ )/,'/
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 309
`
`
`
`section 5.1
`
`A Generic Vector Processor
`
`195
`
`i~po~ed by the arithmetic unit. Two
`requirements
`the bandwidth
`can meet
`major approaches~-~a~~ e~er _ged in commercial vector machines.
`in main memory by using several indepen(cid:173)
`1. Build the necessary bandwidth
`dent memory modules
`to support concurrent access to independent
`data;
`or
`high-speed memory with the necessary bandwidth
`2. Build an intermediate
`and provide a means for high-speed
`transfers between high-speed memory
`and main memory.
`
`that if one memory module can access at most
`The first approach acknowledges
`one datum per access cycle, then to access N independent data in one access
`.L-: ... cycle requires N independent memory modules. The second approach produces
`higher bandwidth by shortening
`the access cycle in a small memory. But the
`small memory
`is loaded from a large memory, and the large memory can still
`be the ulti~ate b~ttl~I_1-~,ck in the system i_~ ~pite of the high bandwidth of the
`small mem'ory -
`· · · ,.._
`·· · '·
`To make best use of the small high-speed memory, we should make multiple
`to this memory. In this way the net demand by the
`use of operands
`transferred
`processor on the large memory is reduced, and bandwidth of the large memory
`required by the processor.
`need not be as )arge as the peak bandwidth
`In the latter part of this chapter we see that another use of the high-speed
`memory is to provide
`for access patterns not available in main memory. Thus,
`we can move a data structure such as a matrix from main memory to intermediate
`memory by using the access patterns supported by main memory.
`When
`the matrix
`is stored
`in intermediate memory, we can provide for
`efficient access to rows, columns, diagonals, or subarrays of the matrix, not all
`of which can be done efficiently when
`the matrix is stored
`in main memory.
`in some cases by providing more
`The second approach has been embelli~hed
`than one level of intermediate memory, with the size, cost, and performance of.
`each level selected
`to give a good cost-performance
`ratio of the total memory
`system.
`
`5.1.1 Multiple Memory Modules
`
`in Fig. 5.2. In this figure main memory is com(cid:173)
`is illustrated
`The first approach
`posed of multiple modules. Eight modules are shown; they c~~p~se a system
`with eight times the bandwidth of a single module. Each of the three data streams
`associated with the arithmetic pipeline has an independent path to the memory
`system so that each stream can be active simultaneously, provided
`that each
`individual module serves only one path at a time.
`Consider how this system can be used to implement vector arithmetic.(We
`/ assume that a basic memory cycle takes two processor cycles, so the bandwidth
`required to service the pipeline
`in Fig. 5.2 is at least six times the bandwidth of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 310
`
`
`
`296
`
`Vector Computers
`
`Chapters
`
`Stream A
`
`Stream B
`
`~
`
`-
`...
`
`~
`
`Pipelined Adder
`
`I M ---■
`l M .....
`-
`-.
`I
`M -
`....-
`I
`M
`I M
`[ M
`I M
`I M
`Fig. 5.2 A vector processor with a memory system composed of eight 3-port memory
`modules.
`
`:.
`
`. ■
`
`Ir,,-
`
`.__
`
`II---
`
`It--
`
`Stream C == A + B
`
`~ -
`
`a single memory module) Figure 5.3 illustrates an ideal solution to our vector
`arithmetic example. The vectors A, B, and C are lajg_~ut in memory so that
`they start respectively in Modules 0, 2, and 4, and their successive elements lie
`in successive memories at addresses that are