throbber
section 4.6
`
`Archltcctura for the Continuum Model-Which Direction?
`
`i85
`
`ess of near-neighbor connections, but the utility of such devices is limited to
`~he calculations that fit the specific array geometries. Nevertheless, such devices
`could be put in high production and swing the cost-performance pendulum
`to
`favor near-neighbor communication. On the other hand, breakthroughs
`in op(cid:173)
`tical transmission and optical switching may swing the balance towards
`the
`erfect shuffle. Such advances would make communication faster over the longer
`iterconnections,
`and reduce the cost of sending data much further than the
`nearest neighbor.
`Consequently, new advances in architectures for the continuum model are
`driven by the advances yet to come in devices and communications.
`
`4.6 Architectures for the Continuum
`Model-Which Direction?
`The continuum model is a natural model for parallelism. Near-neighbor inter·
`actions can be modeled by networks of processors connected together as near(cid:173)
`neighbors. The advantage of the near-neighbor structure is very strong for those
`problems that are ideally matched to such a structure.
`In a broad spectrum of problems, as the fit becomes less ideal, the perfor(cid:173)
`mance of near-neighbor connections becomes poorer and poorer, to the extent
`that gains due to parallel execution are offset by the inefficient use of hardware.
`Here are the basic choices available to the architect:
`
`that is very fast and
`1. Build a highly specialized, near-neighbor architecture
`effective for some class of problems within the continuum model.
`2. Build a somewhat more general machine, but maintain high speed for the
`continuum model. Provide extra capability through richer interconnections,
`such as the perfect shuffle, and through other mechanisms
`that provide
`speed enhancement
`for problems that fall outside the continuum model.
`3. Build a very general parallel machine that has broad applicability, including
`the continuum model, although
`its speed for continuum calculations may
`not be as high as for an architecture specialized for the class of problems.
`
`increases by one to two orders of
`The potential size of the user community
`magnitude as you move from the first to the second choice, and again as you
`move from the second to the third choice. A large user base tends to provide
`cost reductions
`to each user because they have to support a much smaller share
`of the hardware and software development costs.
`A large demand also provides greater profit motivation, but if a designer
`chooses to serve the large community and produces a fairly general architecture,
`the users who absolutely need a machine for the continuum model will ~
`unsatisfied if the general architecture is significantly slower than an architecture
`specialized for the continuum model. Moreover, this same user group will ques-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 300
`
`

`

`Characteristics of Numerical Applications
`
`Chapter 4
`
`that support th
`tion the value to themselves of the hardware and software
`more general classes of problems, since this gro~p of us~rs m~y be paying fo:
`these aspects of the computer system and yet denve no discernible benefit from
`them.
`Which community should the architect serve? There is no obvious answe
`to this question. The architect should be prepared
`to build any of the possibl;
`machines, from the most specialized to the most general, each optimized for
`the best possible cost and performance for that architecture.
`Market forces and other priorities will dictate which machine actually gets
`built. Some developers will choose the most general approach, and hope to
`install many copies of a machine. Some developers will choose to a carve a niche
`for their ideas by producing a relatively small number of ~opies of a highly
`specialized machine. Yet other developers may choose a design that falls in be-
`tween.
`Whichever choice is made, the architecture has to be cost-effective for the
`user community. For the smaller markets, a significant portion of the challenge
`is to keep hardware and software development costs low✓ so that these costs
`when amortized over copies actual1y sold are still within reasonable bounds.
`Thus; not only must the architect produce a cost -effective design, but the design
`process itself must be done efficiently.
`to be an
`One important observation from this chapter is that what appears
`ideal architecture for a class of problems may not be ideal at all. An architect
`who produces a machine that executes a particular code very efficiently may be
`somewhat disappointed when research advances in basic algorithms produce a
`new, efficient solution technique not at all suited to the specific architecture. In
`such a case the very specialized machine may have difficulty competing with
`a less specialized machine that happens
`to be able to run the more efficient
`algorithm.
`Breakthroughs do occur from time to time, such as with the formulation of
`the fast Fourier transform [Cooley and Tukey 1965]. Prior to their work, the best
`algorithm required N2 multiplications and required a particular
`type of access
`to data. The newer; faster algorithm requires only N log N multiplications and
`uses a very different data flow. A machine built for the older algorithm would
`not serve the newer one well. The more specialized
`the architecture,
`the more
`susceptible
`it is to competitive methods when breakthroughs
`do occur. The
`the risk of a breakthrough.
`architect of the specialized machine has to assess
`For the continuum model, the risks are high enough
`to merit attention.
`In recent years, algorithm
`improvements have changed
`the basic flow of
`data in various solution techniques, have altered the grid structure
`that models
`the continuum, and have even provided for multiple grid spacing. A machine
`built specifically for algorithms of 20 years ago would do relatively poorly when
`executing some of the new algorithms for the same problems.
`As an example of the evolution of parallel algorithms,
`the fast algorithms
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 301
`
`

`

`section 4_6 Architectures for the Conttnuum Model-Which Direction?
`
`H7
`
`f r the continuum model described earlier in this chapter may make better use
`~ connection patterns
`like the perfect shuffle than of connection patterns that
`~e near-neighbor mes~ connections, but the near-neighbor connections were
`the backbone for_ th~ first large-scale computers for the continuum. Another
`step in the e~olu?on is represente~ by the Cosmic ~ube described earlier in this
`interconnections and
`chapter, which 1n a sense comb1~es the near-neighbor
`the perfect shuffle. It uses_near-~e1ghbor connections in six dimensions, but at
`best only three of those dimensions
`lead to short interconnections
`in a three(cid:173)
`dimensional packaging world. The other three dimensions force interconnec(cid:173)
`tions to have relatively long physical lengths.
`The six-dimensional connection structure of the Cosmic Cube gives the same
`adjacency pattern _achieved by. the ~erfect sh~ffle. The difference is that all
`dimensions are ad1acent at all times m a Cosmic Cube, whereas the adjacency
`changes in time in a perfect shuffle structure. Because processors that are directly
`connected within a Cosmic Cube have indices that differ by a single power of
`2, this structure
`is well suited for recursive doubling, cyclic reduction, Fourier
`transforms, and other applications mentioned in this section.
`Hoshino [1989], on the other hand, has shown that for the general class of
`scientific calculations the overwhelming majority of processor-to-processor
`in(cid:173)
`teractions occur across near-neighbor
`links on a two-dimensional mesh. The
`additional connectivity provided by a hypercube and the greater distances
`spanned by the perfect shuffle rarely come into play, and provide only a marginal
`decrease in the number of operations while contributing greatly to cost. He
`provides a strong case for two-dimensional mesh connections based on extensive
`experience in implementing scientific applications. Even though his applications
`occasionally force some processors
`to communicate over long distances, this
`happens sufficiently
`infrequently
`that it degrades performance only slightly.
`Hence, Hoshino's case rests on the fact that the communication constraints
`imposed by a two-dimensional mesh do not degrade performance of actual
`programs. Indeed, his PAX architecture is a compromise between the ILUAC
`JV and Cosmic Cube architectures,
`incorporating some good features of each
`together with some features unique to PAX.
`Nevertheless, experience with parallel applications is still rather limited but
`growing every year. New techniques and new algorithms are still appearing in
`abundance. As these appear, they force us to rethink our conclusions on what
`combination of algorithms, architecture, applications and produces an efficient
`way to solve problems.
`In summary,
`there is no obvious best design for parallel processors for the
`continuum model. The available approaches depend on how specialized the
`processing system can be. A processor for the continuum model undoubtedly
`will be somewhat specialized-it will probably have an interconnection system
`to speed up typical programs for this model. Which approach, if any, becomes
`dominant is most likely to depend on the directions of device technology in the
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 302
`
`

`

`Characteristics of Numerical Applications
`
`Chapter 4
`
`coming years, with near-neighbor structures depe:11dent on V~Sl advances and
`perfect-shuffle structures dependent on advances m interconnections technology.
`
`Exercises
`4.1 The object of this ex·ercise is to explore calculations for the con ti.nu um model. Assume
`that you have a square array of points, 9 x 9, and ~hat the value of the potential
`function on the boundary is O on the top row, and 1s 10 along all other boundary
`points.
`a) Initialize the potential function to O on all interior points. Calculate the Poisson
`solution for the values of all interior points by replacing each interior point with
`the new values
`the average value of each of its neighboring points . Compute
`for all interior points before updating any interior points. Run this simulation
`for five iterations and show the answers you obtain at this point. Then run until
`no interior point changes by more than 0.1 percent, and count the total number
`of iterations until convergence. This method is usually called the Jacobi metltod.
`Note: The values on the boundary are fixed and do not change during the
`computation.
`b) Repeat the process in the previous problem, except update a point as soon as
`you have computed the new value and use the new value when you reach a
`neighboring point. You should scan the interior points row by row from top to
`bottom and from left to right within rows. This method
`is usually called the
`Gauss-Seidel method.
`c) The second process seems to converge faster. Give an intuitive explanation of
`why this might be the case.
`d) How do your findings relate to the interconnection
`cessor designed to solve this problem?
`4.2 The purpose of this exercise is to show the effect of information propagation within
`a calculation. Use the Poisson problem of Exercise 4. l(b) and write a computer
`program using the Gauss-Seidel method that iterates until no interior point value
`changes by more than 0.1 percent. Let this be the initial state of the problem for
`the following exercises.
`a) Increase the boundary point on the top row next to the upper left comer to a
`new value of 20. Perform five iterations of the Gauss-Seidel Poisson solver and
`observe the values obtained. Then run the algorithm until no interior point value
`changes by more than 0.1 percent and count the total number of iterations to
`reach this point.
`b) Now restore the mesh to the initial state for a. Change the program so that, in
`effect, the upper left corner is rotated to the bottom right corner. To do this,
`scan the rows from right to left instead of left to right and scan from bottom to
`top instead of from top to bottom. Perform five iterations of the Poisson solver
`and observe the values obtained. Run the program until no interior point changes
`by more than 0.1 percent, and count the number of iterations to reach this point.
`
`structure of a parallel pro(cid:173)
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 303
`
`

`

`Exercises
`
`289
`
`c) Both a and b eventually converge to the same solution because the initial data
`are the same and the physical process modeled is the same. However, the results
`obtained
`from a and 11 are different after five iterations. Explain why they are
`different. Which of the two problems has faster convergence? Why?
`4.3 The purpose of this exercise is to examine the cyclic-reduction algorithm. Explore
`the solution of a one-dimensional Poisson problem by treating 15 points on a line.
`Let the left boundary point, point 0, have the value 10 and the right boundary
`point, point 16, have the value 0. Each of the 15 intem1ediate points has a value
`that is the average of its immediate neighbors.
`a) Write a matrix equation of the form Ax = b that describes this problem.
`b) Simulate an iterative process that updates each interior point with the average
`of its neighboring points. Obtain the interior values of points for the first three
`iterations of the technique previously used, in which each interior point is up•
`dated bv the average of its neighbors.
`c) Now apply the cyclic-reduction algorithm in the text for three iterations to find
`one equation
`for the point in the middle. Solve this equation and use three
`iterations of back substitution
`to find the remainder of the points. Show your
`solution and the equations you obtain after each iteration. (Hint: The first iteration
`should produce new equations for points 2, 4, 6, 8, 10, 12, and 14. The second
`iteration produces new equations for 4, 8, and 12.)
`d) Compare
`the results produced
`in band c with respect to the precision obtained.
`Count and compare
`the total number of additions, multiplications, <1.nd divisions
`for each algorithm after three iterations .
`e) Explain from an intuitive point of view why cycl~c reduction yields high speed
`and high precision as compared to the near-neighbor
`iteration. What implications
`can you draw with regard
`to interconnections
`for processors for solving the
`Poisson problem?
`4.4 The purpose of this exercise is to investigate how to implement conditional branches
`if
`in an array computer. Program 4.1 does not show instructions
`that determine
`convergence has been reached. The instructions should determine if every processor
`has obtained a satisfactory solution, and, if not, the progr.am should branch back
`to the top of the loop .
`the instructions as you need
`that do this job, inventing
`a) Write the instructions
`that you invent.
`them. Describe the operation of each instruction
`b) Redraw the block diagram of the ILLIAC IV computer and describe the data flow
`on the block diagram necessary
`to support
`the test for termination.
`c) Assume
`that the control processor of the lLLlAC IV can execute its instructions
`to the 64 numerical processor:..
`in parallel with instructions
`that are broadcast
`test be overlapped with the cal(cid:173)
`Can any or all instructions of the termination
`culation of a loop iteration? If so, describe ho\'\' to implement
`the instructions
`in
`your program and in Program 4.1 to facilitate this overlapped execution.
`4.5 The purpose of this exercise is to explore the interconnection structure of a hypercube
`that you are to cakulate all partial
`computer such as the Cosmic Cube. Assume
`sums of i items up to the sum of 64 items.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 304
`
`

`

`190
`
`Characteristics of Numerical Appllcattons
`
`Chapter 4
`
`a) Construct a program for a Cosmic Cilbe computer system that performs th·
`in a time that grows as O(log N) if the number of processors is ~s
`operation
`Assume that every node in the computer executes the same program, althouJ h
`the program can be slightly different from node to node since the processors~
`a Cosmic Cube are independent. Show ex~licitly t~e instructions
`that send an~
`receive data between pr~cessors. Invent mstructions ~s you ~eed them and
`describe what the instructions do. In~ude so~e typ_e of m~truchon for synchro(cid:173)
`nization that forces a processor to be idle until a ne1ghbonng processor sends
`a
`message or a datum that enables computation
`to continue.
`b) Which communication steps if any in your answer require communications with
`processors that are not among the six processors directly connected to a given
`processor? How do you propose to implement such communication
`in software
`{assuming that the hardware itself does not provide remote communication as
`a basic instruction)?
`4.6 The purpose of this exercise is to examine the recursive-doubling solution to a linear
`tridiagonal system of equations. Consider the solution of the equation Ax = b,
`where A is a tridiagonal equation.
`a) Prove that the recurrence in Eq. ( 4.15) is a correct expression for the major
`diagonal of matrix U in an LU decomposition of A.
`b) Using recursive doubling, show all of the steps required to factor A into LU and
`to solve the equations Ly= band Ux = y. For each major step of the algorithm,
`show the basic recurrence solution. Show the mathematical formulation of your
`solution and indicate the basic operation in the recursive-doubling
`iteration.
`
`4.7 Find a recursive-doubling technique for solving Eq. (4.13).
`
`4.8 The purpose of this exercise is to explore some of the properties of the perfect(cid:173)
`shuffle interconnection scheme.
`a) Consider a processor that has the perfect shuffle and pair-wise exchange con(cid:173)
`nections shown in Fig. 4. 16. For an eight~processor system, show that the per(cid:173)
`mutation that cyclically shifts the input vector by three positions is realizable by
`some setting of the exchange modules. Draw the network unrolled in time to
`show the setting that realizes this permutation.
`b) Repeat a to show that a cyclical shift of two positions is realizable.
`c) Prove that a shuffle-exchange network can realize any cyclical shift in log2 N
`iterations for an N-processor system when N is a power of 2.
`
`4.9 Find a means for evaluating a polynomial of degree N - 1 in the variable x in
`parallel on an N-processor computer that uses the shuffle-exchange
`interconnection
`pattern. Assume that N is a power of 2.
`
`4.10 Prove that the scheme shown in Fig. 4.18 produces a sorted sequence of length N
`from a bitonk sequence of length N. Specifically, prove that after the comparison
`and exchange is performed, each sequence of length N/2 is bi tonic and all elements
`of one sequence do not exceed the value of any element of the other sequence.
`4.11 Consider a tridiagonal linear system such as that described in Section 4.4.4. Assume
`that the problem is symmetric about the major diagonal so that ai.i = aj.i· (The indices
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 305
`
`

`

`Exercises
`
`291
`
`r)Qi_ 1(x) -
`
`(a;_;_1)2Q;. :?.(x)
`
`; and j lie in the range 1 :S i, j s N, where N is a power of 2.) The Sturm polynomials
`for the matrix A are the polynomials Q,(x) defined by the recurrence
`Q,{x) = (au -
`Qo(X) = 1
`Q1(X) = a1,1 - X
`it is necessary to find the number ol
`For a very important matrix computation
`changes of sign for a given value of x in the sequence of values Q0{x), Q1(x), .
`.
`. ,
`QN-1(X).
`a) Assume that you wish to find the number of sign changes for a single value of
`x, and you have N processors available to do the calculation in parallel. Work
`out a recursive-doubling algorithm for the calculation.
`b) Show a block diagram of a connection pattern suitable for this algorithm within
`which each processor
`is connected
`to a fixed constant number of processors
`regardless of the size of N, and for which at each step of the algorithm the data
`are accessible in a constant number of steps from neighboring processors, re(cid:173)
`gardless of the size of N.
`c) Now assume that you wish to find the number of sign changes for N different
`values of x. Compare the time taken by running your recursive-doubling algo(cid:173)
`rithm N times to the time required to obtain values of the Sturm polynomials
`serially for each of the N values of x. Which of the two methods is preferred?
`d) Now assume that you wish to compute the number of sign changes for a number
`of values of x much larger than the value of N. Which of the two methods is
`better?
`4.12 Figure 4.15 shows a shuffle-exchange nen-.•ork with a cyclical shift interconnection
`pattern superimposed. Show that it is possible to compute the same set of partial
`sums computed in the figure without the cyclic-shift pattern, using only the perfect(cid:173)
`shuffle and the pair-wise-exchange patterns of Fig. 4.16. Your algorithm will need
`to send more than one datum from one cell to a cell in the next column, but the
`number of different data transmitted from column to column is a constant that is
`independent of N.
`4.13 a) Show the switch settings for a shuffle-exchange network as depicted in Fig. 4.16
`that send input cell i to output cell 3i mod N for N = 16.
`b) For each integer in the range O s i s 15, write the value of i in binary followed
`by the value of 3i mod 16 in binary . Start a new row for each integer and align
`the binary values to create a table of size 16 rows by 8 columns. Examine row i
`for each i. Show that the last four bits in each row are related to the switch
`settings from part a. In fact, these bits show the switch settings for input i as it
`passes through
`the neh-vork. (Hint : Use the shift-register analogy.)
`that takes i into pi mod N for i ::S N is a permutation
`4.14 a) Prove that the function
`when N is a power of 2 and pis odd. (Hint: The function is a permutation if you
`can show that when pi = pj mod N, this implies that i = j.}
`b) Apply your reasoning
`from b of Exercise 4.13 to show that a shuffle-exchange
`that takes i to pi mod
`network has switch settings
`that realize the permutation
`N for every odd value of p.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 306
`
`

`

`5
`
`The tucked-up sempstress walks with
`hasty strides, Wh!le ~treams run down
`her oil' d umbrella s sides.
`-Jonathan Swift, 1711
`
`Vector Computers
`
`5.1 A Generic Vector Processor
`5.2 Access Patterns for Numerical Algorithms
`.
`5.3 Data-Structuring Techniques for Vector Machines
`5.4 Attached Vector-Processors
`5.5 Sparse-Matrix Techniques
`5.6 The GF-11-A Very High-Speed Vector Processor
`5. 7 Final Comments on Vector Computers
`
`The last chapter introduces the idea of building a parallel architecture matched
`to a specific class of problems. The discussion there mentions that there are two
`major models of numerical processes-a continuum model based on near-neigh(cid:173)
`bor interactions and a particle model based on discrete point-to-point interac(cid:173)
`tions. The major emphasis of Chapter 4 is the continuum model, together with
`the architectures that support processing of near-neighbor interactions for that
`model.
`This chapter extends the discussion of numerical architectures
`to vector
`computers with the idea that these computers can be used for the majority of
`continuum-model problems, as well as for many particle-model problems. The
`vector computer has emerged as the most important high-performance archi(cid:173)
`tecture for numerical problems. It has the two key qualities of efficiency and
`wide applicability.
`Most vector computers have a pipelined structure. When one pipeline is not
`sufficient to achieve desired performance, designers have occasionally provided
`
`292
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 307
`
`

`

`section 5.1
`
`A Generic Vector Processor
`
`293
`
`multiple pipelines. Such processors not only support a streaming mode of data
`they also support fully parallel operation by
`flow through a single pipeline,
`allowing multiple pipelines to execute concurrently on independent
`streams of
`data.
`By the mid-1980s, more than twenty manufacturers offered vector processors
`based on pipeline arithmetic units. They ranged from relatively inexpensive
`auxiliary processors attached to microcomputers
`to high-speed supercomputers
`with computation
`rates from 100 Mflops to rates in excess of 1000 Mflops. (One
`Mflops is 106 floating-point op~rations per second.)
`The price-performance
`ratio of these vector processors is rather remarkable
`because they yield one to two orders of magnitude
`increased throughput
`for
`vector computations when compared
`to serial processors of equal cost. But this
`/throughput increase is limited to the problems that fit the architecture-that
`is,
`to problems
`that can be structured as a sequence of vector operations whose
`characteristics make efficient use of the facilities available.
`Many of the supercomputers
`are also high-performance serial processors for
`general-purpose problems, but the throughput of these supercomputers on non(cid:173)
`is only a few times greater than the throughput of more con(cid:173)
`vector problems
`ventional high-speed
`serial processors.
`In fact, although
`throughput might be
`if a vector-structured
`high because of fast device technology,
`supercomputer
`is
`the computational cost may be exces(cid:173)
`used exclusively on nonvector problems,
`sive because this cost includes the cost of the vector facilities, which presumably
`are left idle by scalar computations.
`The purpose of this chapter is to describe the general architecture of vector
`machines and then describe how algorithms and architecture can be matched
`to each other to obtain efficient processing over large classes of computations.
`
`5.1 A Generic Vector Processor
`The basic idea of a vector processor
`is to combine two vectors, element by
`element, to produce an output vector. Thus, if A, B, and C are vectors, each
`with N elements, a vector processor can perform the operation
`
`-
`which is in'ff~:ed
`
`C :=A+
`
`B
`
`to mean
`ci : = ai + b;, 0 ~ i < N - 1
`where the vector C can be written in component
`form is similar for vectors A and B.
`A very simplified way to implement
`this operation with a pipelined arith(cid:173)
`metic unit is shown in Fig. 5.1. The two streams of data supplied to the arithmetic
`unit carry the streams for A and B, respectively. The memory system suppli~s
`_:-_.:, .,..
`
`form as (co, C1, •.•
`
`, cN_1). The
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 308
`
`

`

`294
`
`Vector Computers
`
`Chapter 5
`
`Stream A
`
`Stream B
`
`-~
`
`...
`
`~
`
`Stream C :;;;;: A + B
`
`~
`
`~
`
`Pipelined Adder
`
`Mullipon
`Memory
`System
`
`I
`
`Fig. 5.1 A processor that is capable of adding two vectors by streaming the two vectors
`through a pipelined adder.
`
`..;,
`
`(
`
`,/;..,..
`
`, • ..i,.-
`-
`
`one element of A and B on every clock cydeJ one element to each input stream.
`The arithmetic unit produces one output value during each dock cycle. (Actually,
`the input data rate need be only as fast as the output data rate. If the arithmetic
`unit can produce results at a rate of one output value every d cycles, then the
`input data rate need be only one input value on each stream every d cycles.)
`-2>/ Figure 5.1 shows only the 9ar~st details of the vector processor to indicate
`· the general flow of data through the pipelines. The pipelined arithmetic unit 15
`discussed in Section 3.4 and that unit is the core of the architecture in Fig. 5.1.
`The difficulty, however, is the design of the memory system to sustain a
`continuous flow of data from memory to the arithmetic unit and the retuin fiow
`of results from the arithmetic unit to memory. The majority of the architectural
`tricks used in vector processors are devo_ted to sustaining that flow of data and
`~,1, ..,.i.....,.,
`to scheduling sequences of operations to reduce the flow requirements.
`/
`{In this example we assume a basic one-cycle rate for the deljv..ery of operands,
`production of results, and restoring of the result data into m;mory. This calls
`for a memory system that can read two operands and write one operand in a
`single cycle}
`c,,,'//.,'.;Conventional random-access memories can perform at most one READ or
`one WRITE per cycle✓ so the memory system in Fig. 5.1 has at least three times
`the bandwidth of a conventional memory system. Of course this ignores any
`additional requirement
`for bandwidth for input/output operations. Also, we
`have ignored the bandwidth for instruction fetches✓ but a major advantage of a
`vector architecture is that a single instruction fetch can initiate a very long vector
`operation. Consequently,
`the bandwidth required
`to fetch instructions for a
`vector architecture
`is negligible as compared to the 20 to 50 percent of the
`bandwidth used for instruction fetches in conventional architectures.
`The major problem f~cing the architect is to design a memory system that
`
`C) ,-_/ )/,'/
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 309
`
`

`

`section 5.1
`
`A Generic Vector Processor
`
`195
`
`i~po~ed by the arithmetic unit. Two
`requirements
`the bandwidth
`can meet
`major approaches~-~a~~ e~er _ged in commercial vector machines.
`in main memory by using several indepen(cid:173)
`1. Build the necessary bandwidth
`dent memory modules
`to support concurrent access to independent
`data;
`or
`high-speed memory with the necessary bandwidth
`2. Build an intermediate
`and provide a means for high-speed
`transfers between high-speed memory
`and main memory.
`
`that if one memory module can access at most
`The first approach acknowledges
`one datum per access cycle, then to access N independent data in one access
`.L-: ... cycle requires N independent memory modules. The second approach produces
`higher bandwidth by shortening
`the access cycle in a small memory. But the
`small memory
`is loaded from a large memory, and the large memory can still
`be the ulti~ate b~ttl~I_1-~,ck in the system i_~ ~pite of the high bandwidth of the
`small mem'ory -
`· · · ,.._
`·· · '·
`To make best use of the small high-speed memory, we should make multiple
`to this memory. In this way the net demand by the
`use of operands
`transferred
`processor on the large memory is reduced, and bandwidth of the large memory
`required by the processor.
`need not be as )arge as the peak bandwidth
`In the latter part of this chapter we see that another use of the high-speed
`memory is to provide
`for access patterns not available in main memory. Thus,
`we can move a data structure such as a matrix from main memory to intermediate
`memory by using the access patterns supported by main memory.
`When
`the matrix
`is stored
`in intermediate memory, we can provide for
`efficient access to rows, columns, diagonals, or subarrays of the matrix, not all
`of which can be done efficiently when
`the matrix is stored
`in main memory.
`in some cases by providing more
`The second approach has been embelli~hed
`than one level of intermediate memory, with the size, cost, and performance of.
`each level selected
`to give a good cost-performance
`ratio of the total memory
`system.
`
`5.1.1 Multiple Memory Modules
`
`in Fig. 5.2. In this figure main memory is com(cid:173)
`is illustrated
`The first approach
`posed of multiple modules. Eight modules are shown; they c~~p~se a system
`with eight times the bandwidth of a single module. Each of the three data streams
`associated with the arithmetic pipeline has an independent path to the memory
`system so that each stream can be active simultaneously, provided
`that each
`individual module serves only one path at a time.
`Consider how this system can be used to implement vector arithmetic.(We
`/ assume that a basic memory cycle takes two processor cycles, so the bandwidth
`required to service the pipeline
`in Fig. 5.2 is at least six times the bandwidth of
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2135, p. 310
`
`

`

`296
`
`Vector Computers
`
`Chapters
`
`Stream A
`
`Stream B
`
`~
`
`-
`...
`
`~
`
`Pipelined Adder
`
`I M ---■
`l M .....
`-
`-.
`I
`M -
`....-
`I
`M
`I M
`[ M
`I M
`I M
`Fig. 5.2 A vector processor with a memory system composed of eight 3-port memory
`modules.
`
`:.
`
`. ■
`
`Ir,,-
`
`.__
`
`II---
`
`It--
`
`Stream C == A + B
`
`~ -
`
`a single memory module) Figure 5.3 illustrates an ideal solution to our vector
`arithmetic example. The vectors A, B, and C are lajg_~ut in memory so that
`they start respectively in Modules 0, 2, and 4, and their successive elements lie
`in successive memories at addresses that are

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket