throbber
[19]
`United States Patent
`5,274,832
`[11] Patent Number:
`[45] Date of Patent: Dec. 28, 1993
`Khan
`
`
`
`||||||llllllllllllllllIlllllllllIlllllllllIllllllllllllllllllllllllllllllll
`USOOS274832A
`
`[s4] svsrouc ARRAY FOR
`MULTIDIMENSIONAL MATRIX
`COMPUTATIONS
`
`[75]
`
`Inventor:
`
`Emdadur R. Khan, San Jose. Calif.
`
`[73] Assignee: National Semiconductor Corporation,
`Santa Clara. Calif.
`
`[21] App]. No.: 592,954
`
`[22] Filed:
`
`Oct. 4, 1990
`
`[51]
`[52]
`
`Int. Cl.5 ............................................ cosF 15/347
`
`[1.8. CI. ............. 395/800; 364/271.2;
`364/258; 364/2582; 364/2768; 364/DIG. 1;
`364/754
`[58] Field of Search ................... 364/754. 845,728.05,
`364/841; 395/800
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`364/754
`1/1985 Kung
`4,493,048
`.. 364/845
`7/1986 Bocker ......
`4,603.398
`364/728
`4,701,876 10/1987 McCanny ......
`
`364/754
`5,107,452
`4/1992 Karmarkar
`5,146,543
`9/1992 Vassiliadis ............................. 395/27
`
`OTHER PUBLICATIONS
`
`S. Y. Kung & J. N. Hwang, “Parallel Architecture for
`Artificial Neural Nets,” lJCNN, 1989, pp. 11—165—11-
`—l72.
`
`N. Ling & M. A. Bayoumi, “Algorithms for High Speed
`Mum-Dimensional Arithmetic and DSP Systolic Ar-
`rays," Proceedings of the 1988 International Confer-
`ence on Parallel Processing. pp. 367-374.
`S. Y. Kung & J. N. Hwang, “A Unifying Algorith-
`m/Architecture for Aritificial Neural Networks,"
`IEEE Magazine, Feb. 1989, pp. 2505-2508.
`
`Primary Examiner—Eric Coleman
`Attorney. Agent, or Firm—Limbach & Limbach
`
`ABSTRACT
`[57]
`A multidimensional systolic array processor uses a mul-
`tidimensional array of systolically coupled processing
`elements to perform matrix-vector multiplication of
`matrix and vector signal sets. A two-dimensional array
`uses a PxQ matrix (P rows and Q columns) of process-
`ing elements which are coupled to systolically process
`the signals, e.g. via multiplication and accumulation.
`The processing elements are coupled both row-to-row
`and column-to—column for pipeline processing within
`each row and each column, i.e. multidimensional pipe-
`lining,
`thereby increasing processing parallelism and
`speed. Interconnectivity of the processing elements is
`minimized by forming separate column and row signal
`subsets of the vector signal set which are coupled simul-
`taneously to each processing element in the first row
`and first column, respectively. Size of the processing
`elements is minimized by reducing local storage of ma-
`trix signal subsets within each processing element. Sepa-
`rate column and row signal subsets of the matrix signal
`set are formed and coupled into each processing ele-
`ment of the first row and first column, respectively. As
`the matrix column and row signal subsets are systoli-
`cally processed and transferred row-to-row and co-
`lumn-to-column, respectively, each signal subset is re-
`duced in size by one signal, thereby requiring the trans-
`fer and temporary local storage of successively smaller
`matrix signal subsets. A three-dimensional processor
`uses a PXQXT array (T planes of P rows and Q col-
`umns) of processing elements which are coupled plane-
`to-plane-to-plane.
`
`45 Claims, 18 Drawing Sheets
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`Petitioner Microsoft Corporation - EX. 1020, p. 1
`
`

`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 1 of 18
`
`5,274,832
`
`"1.1
`
`W1,2
`
`Wu '
`
`° "W
`
`W22
`
`w2.1
`
`“3.1
`.
`
`' .
`
`WI";
`
`w2.I<
`
`"3.x
`'
`
`Wm
`
`Wm
`
`WMJ'
`
`'
`
`' Wm
`
`V1
`V2
`v3
`
`;
`VJ
`
`VK
`
`=
`
`01
`02
`03
`
`;
`OI
`
`0M
`
`(PRIOR ART)
`FIG.
`1
`
`"1.1
`
`V1
`
`n2.1
`
`"2.3
`
`13
`
`IJ
`
`14
`
`'
`
`"1.3 V3
`0
`
`O
`
`O
`
`’“2.I
`"1,J'VJ mp.
`"”6VA 2 "2’“
`
`® .
`
`‘
`
`an O
`
`(m
`
`0
`
`LAYER 1
`
`LAYER 2
`
`(PRIOR ART)
`FIG. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`Petitioner Microsoft Corporation - EX. 1020, p. 2
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 2 of 18
`
`5,274,832
`
`Tx
`ggooox‘
`B:
`3
`
`N
`Soo-i‘sS-n
`
`E o
`
`o
`
`0
`
`£1.
`
`I—
`
`’6’
`
`B 3” x
`
`n.
`; .
`
`.
`
`x_
`. ;
`
`EN)
`(.52:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`Petitioner Microsoft Corporation - EX. 1020, p. 3
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 3 of 13
`
`5,274,832
`
`x.
`x
`X.
`x_
`00.000 ; :;000:
`
`x>
`
`mower
`
`cums 01
`
`o o o
`
`"7
`2
`
`7')
`>
`
`o
`
`o
`
`n
`:3- 3 3000:.
`3
`3
`3
`3
`
`“1,1
`
`W11
`
`“3.1
`
`“MI
`
`v1
`
`(PRIORART)4
`
`FIG.
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`Petitioner Microsoft Corporation - EX. 1020, p. 4
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 4 of 18
`
`5,274,832
`
`ooo01H30oo
`
`KN’oooOlv—Na;oooooo0NH3—.N3
`
`oofin;
`
`E:SE
`
`é;...if.......Ni;5;
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`Petitioner Microsoft Corporation - EX. 1020, p. 5
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 5 of 18
`
`5,274,832
`
`w“n+2 '
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`Petitioner Microsoft Corporation - EX. 1020, p. 6
`
`V’zx—on
`
`. “
`
`wk-0+1
`
`Wux-o
`
`V1 W1"
`
`V2 0
`
`w1.2
`
`”:12
`
`. “
`
`m
`
`w2.1
`
`O.
`
`“um
`
`Vo+1 w1.0+1 WM 0
`
`“n+1
`
`'
`
`"1.0+2
`
`w2.c)+2
`
`Wu.o+1
`
`.
`
`.
`
`.
`
`

`

`US. Patent
`
`Dec. 23, 1993
`
`Sheet 6 of 18
`
`5,274,832
`
`HG.BB
`
`z
`
`2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`Petitioner Microsoft Corporation - EX. 1020, p. 7
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 7 of 13
`
`5,274,832
`
`0(JIIIIIK{{—omoooN—>b.oooN—
`
`
`
`
`
`a.>95;..A175;...EN.;..:538;...oc-n_v.£...o~.£..._+o.£3;...E;
`
`9;
`
`
`
`a:E”.H:9;EE...!...E...EB...EB
`
`nE5.:<5.8
`
`
`2—Hmv>amp-2’OOOIIO‘23
`
`—+O>oooo0FN:
`
`ON>oo
`
`...cEE...E
`
`E:83E.o:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`Petitioner Microsoft Corporation - EX. 1020, p. 8
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 8 of 18
`
`5,274,832
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`Petitioner Microsoft Corporation - EX. 1020, p. 9
`
`

`

`US. Patent
`
`Dec.28, 1993
`
`Sheet 9 of 18
`
`5,274,832
`
`f"
`
`V0+1""1,c;+1"q+2w °
`VIZ-'0.”
`1.0+2
`"gm
`w ‘
`.
`“'0“
`
`wM'.0+2
`
`KEYTOHGBA
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`Petitioner Microsoft Corporation - EX. 1020, p. 10
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 10 of 18
`
`5,274,832
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`Petitioner Microsoft Corporation - EX. 1020, p. 11
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 11 of 18
`
`5,274,832
`
`"may..EEO£0
`
`.zwIIIIIIEWJ
`2.22:0«.50«SQ_3.“rIOIO
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`Petitioner Microsoft Corporation - EX. 1020, p. 12
`
`
`

`

`US. Patent
`
`238,4n
`
`5,2.2...
`
`m3,u>
`
`hoIsn..unnu..nun0.II.oao0.Il:ooo..
`
`
`1...m,.3,...n;u;m},23;...13>E;m;miiiDfE,
`.mxiiWiJn.3.3.o3.Sw>mooo>N>>¥>mooo>ON>O>
`
`«+3;...«a:“+3,"5,n_.n>ooa“n.n>Nfi>—.n>fln+0lx>ooomn+ON>n+0>n>m.5...a:q:“a:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`Petitioner Microsoft Corporation - EX. 1020, p. 13
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 13 of 18
`
`5,274,832
`
`fits;
`
`$3.;
`
`flan—#3o
`.oflan—£3
`
`tun—J;
`
`3.;
`
`$5,
`
`«fin—N;
`
`flan—figo
`..E3:
`
`is.
`
`Eng;
`
`$2.;
`
`«3...;
`
`..33.;
`
`flu"fin:
`
`sen:
`
`o>o;
`
`z.0:
`
`oofinu—fit
`
`as;
`
`3.3,
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`Petitioner Microsoft Corporation - EX. 1020, p. 14
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 14 of 18
`
`5,274,832
`
`Qua—.5»
`
`«Nu—#3
`
`Gig;
`
`..33.;
`
`mat—.5»
`
`52.;
`
`Ea;
`
`3a.;
`
`3a.;
`
`ootwuugr
`3a.;
`
`NJNJB
`
`23.;
`
`«Nari;
`
`0.33;
`
`ao*.—un.—3
`min...’
`
`«.333
`
`s,a;
`
`25;
`
`«N"a:3
`
`35;
`
`o0*.—£.—’
`3a.;
`
`NJEJ?
`
`N—.0:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`Petitioner Microsoft Corporation - EX. 1020, p. 15
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 15 of 18
`
`5,274,832
`
`WTCLOCK
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`Petitioner Microsoft Corporation - EX. 1020, p. 16
`
`

`

`. US. Patent
`
`Dec. 28, 1993
`
`Sheet 16 of 18
`
`5,274,832
`
`$6—
`
`8:£2
`
`o>o;
`
`was;3%;
`
`859is._.l.IIlllllllllL
`--i§§_asa.
`3%25m6%“.
`
`o>0;B.0:
`
`8:£28—
`
`l.1..Illlllllll.\|_5EN590
`
`_ P
`
`E-2
`
`
`II-III”flaw
`.mrllll'_
`
`z
`

`
`I
`
`,r
`
`E
`
`05%
`
`.229m3
`
`32
`
`a8—
`
`>“
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`Petitioner Microsoft Corporation - EX. 1020, p. 17
`
`

`

`US. Patent
`
`Dec. 28, 1993
`
`Sheet 17 of 18
`
`5,274,832
`
`
`
`O
`
`O
`
`O m
`
`\ 1
`
`18
`
`a
`
`
`
`I]
`II
`
`122
`"2*“
`VR
`
`104
`I -"13
`'
`
`o
`
`o
`
`o
`
`120
`
`102
`
`105
`
`FIG. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`Petitioner Microsoft Corporation - EX. 1020, p. 18
`
`

`

`
`
`
`r'l‘l'r'fi'l'r
`
`
`léflMEj
`
`lily/5.1%
`‘l-l‘lnu'
`gala-'11;
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 19
`
`

`

`1
`
`5,274,832
`
`5
`
`10
`
`15
`
`2O
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`65
`
`SYSTOLIC ARRAY FOR MULTIDIMENSIONAL
`MATRIX COMPUTATIONS
`
`BACKGROUND OF THE INVENTION
`1. Field of the invention
`The present invention relates to array processors, and
`in particular, to systolic array processors which process
`multiple signals in parallel in multiple dimensions.
`2. Description of the Related Art
`Systolic processors, i.e. processors which systolically
`“pump," or transfer, data from one processing element
`to another, are well known in the art. Systolic arrays
`have been used to increase the pipelined computing
`capability, and therefore the computing speed, of vari-
`ous types of signal processors.
`Systolic array processors are particularly useful for
`processing, e.g. multiplying, two signal sets, where the
`first signal set represents a matrix parameter set and the
`second signal set represents a vector parameter set. In
`other words, the first signal set represents a matrix pa-
`rameter set which can be represented as an M-by-K
`("MXK") matrix having M rows and K columns of
`parameters, and the second signal set represents a le
`vector having K rows and 1 column of parameters.
`Referring to FIG. 1, a representation of the matrix-
`vector multiplication of two such signal sets can be
`seen. The matrix signal set W has matrix signals W;J
`and the vector signal set V has vector signals VJ, where
`l is an element of the set {l,2,3, .
`.
`.
`,M} and J is an
`element of the set {1,2,3, .
`.
`. ,K}. This can be expressed
`mathematically by the following formula:
`
`KI
`
`J: 1 WIJVJ
`
`01
`
`Such signal sets are also found in many artificial neu-
`ral network models, including the Hopfield neural net-
`work model. Referring to FIG. 2, a simple artificial
`neural network with its associated signal sets can be
`seen. The first layer of neurons nu. or nodes, receives
`some form of input signals lJ, and based thereon, gener-
`ates a number of voltage signals VJ, which can be repre-
`sented as a voltage vector V.
`Coupling the respective voltage signals VJ to the
`second layer of neurons nu are a number of scaling
`elements (e.g. “adaptive weights”), which introduce
`scaling, or “weigh,” signals Wu for
`scaling or
`“weighting” the voltage signals VJ prior to their being
`received by the second layer neurons nu. It will be
`understood that, with respect to the subscripted nota-
`tion for representing the scaling or weighting signals
`Wu, the first subscripted character “I” represents the
`destination neuron an in the second layer, and the sec-
`ond subscripted character “J” represents the source
`neuron nu of the voltage signal VJ in the first layer.
`The simplest form of systolic processing array used
`for performing the matrix-vector multiplication of sig-
`nal sets, as discussed above,
`is one-dimensional. One
`type of one-dimensional systolic array is a "ring“ sys-
`tolic array, shown in FIG. 3.
`The systolically coupled processing elements NJ are
`interconnected as shown, with signal flow represented
`by the arrows. First, the corresponding voltage signals
`VJ are initially coupled into their corresponding pro-
`cessing elements NJ. Then, following the application of
`each clock pulse (not shown, but common to each pro-
`cessing element NJ), the matrix signals WJJ are sequen-
`
`2
`tially inputted to their corresponding processing ele-
`ment NJ, as shown. Therein, each matrix signal WIIJ is
`multiplied by its corresponding voltage signal VJ and
`accumulated, i.e. stored, within the processing element
`NJ.
`the foregoing is
`Following the next clock signal.
`repeated, with the voltage signals VJ being transferred
`to subsequent processing elements NJ to be multiplied
`by the corresponding matrix signal Wu therein. For
`example, the voltage signals V] which are transferred
`between the processing elements NJ are shown in paren-
`theses. This is repeated K—l times, i.e. for a total of K
`times, to produce the final matrix-vector product out-
`puts 01. The “ring" configuration facilitates multiple
`iterations of the matrix-vector products, a desirable
`feature used in the learning phase of an artificial neural
`network. Further discussions of the ring systolic array
`can be found in “Parallel Architectures for Artificial
`Neural Nets," by S. Y. Kung and J. N. Hwang, UCNN
`1989, pp. 11-165 through Il-l72.
`A second type of one-dimensional systolic array re-
`lies on a configuration in accordance with the
`“STAMS” (Systematic Transformation of Algorithms
`for Multidimensional Systolic arrays) technique. The
`STAMS technique is discussed in detail in “Algorithms
`for High Speed Multidimensional Arithmetic and DSP
`Systolic Arrays,” by N. Ling and M. A. Bayoumi, Pro-
`ceedings of the 1988 lntemational Conference on Paral-
`lel Processing, Vol. I, pp. 367—374. An example of a
`one-dimensional STAMS systolic array is shown in
`FIG. 4. dimensional arrays is shorter and more process-
`ing is done in parallel. This three-dimensional configu-
`ration requires only T+K—1 clock cycles.
`Even though the two-dimensional and three-dimen-
`sional STAMS systolic array configurations discussed
`above provide improvements with respect to processing
`speed, minimal if any improvement is provided with
`respect to the number and complexity of the local or
`global interconnections required for inputting the ma—
`trix W and vector V signals. Furthermore, even though
`the one—dimensional ring systolic array already provides
`reasonable processing speed, its requisite global inter-
`connections are complex and impractical. Moreover, no
`improvements are provided by any of the foregoing
`arrays with respect to matrix signal Wu storage re-
`quirements.
`Moreover, the two- and three-dimensional STAMS
`systolic array configurations described above are not
`truly two- or three-dimensional, respectively. The two-
`dimensional array, as well as each two-dimensional
`array plane within the three-dimensional array, have
`their processing elements NAB interconnected along
`one dimension only, e.g. left to right. Therefore, the
`systolic processing actually occurs in one dimension
`only. Thus, full multidimensional parallelism or pipelin-
`ing is not achieved and maximum processing speed, i.e.
`minimum processing time, cannot be achieved.
`It would be desirable to have a true multidimensional
`systolic array configuration providing true multidimen-
`sional pipeline operation to maximize processing speed.
`It would be further desirable to have such a multidimen-
`sional systolic processing array in which minimal global
`or local interconnects are required for inputting the
`matrix
`
`First, just as in the ring systolic array of FIG, 3. the
`voltage signals VJ are initially inputted into their re-
`spective processing elements NJ. Then, the matrix sig-
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`Petitioner Microsoft Corporation - EX. 1020, p. 20
`
`

`

`3
`nals W]J are inputted into the processing elements NJ,
`with each respective processing element NJ receiving
`one column of the matrix of matrix signals W“, as
`shown. The weight-voltage products are summed with
`the corresponding weight-voltage products from the
`preceding processing element NJ_1 and then systoli-
`cally transferred to the next processing element NJ+],
`and the process continues.
`The inputting of the matrix signals Wu into each
`successive processing element NJ is delayed by one
`additional clock pulse per processing element stage to
`allow for the delays associated with the systolic trans-
`ferring of the accumulated products. This delay can be
`accomplished by inputting zeros to a processing ele.
`ment NJ until the systolically transferred accumulated
`products begin to arrive. However, this delay adversely
`affects the processing speed. As compared to the ring
`systolic array of FIG. 3 which requires K clock cycles,
`the STAMS systolic array requires ZK-l clock cycles
`to obtain the product outputs O] of this matrix-vector
`multiplication.
`A number of problems are associated with using these
`one—dimensional systolic arrays. One problem involves
`the inputting of the voltage signals VJ. If the voltages
`VJ are to be loaded simultaneously in parallel, global
`interconnects are required to accomplish this. If they
`are to be loaded sequentially in serial, numerous local
`interconnects are required, as well as K clock cycles.
`Another problem involves the inputting of the matrix
`signals Wu. If the matrix signals Wu are stored locally
`within each processing element NJ, the processing ele-
`ments NJ must be large enough to provide sufficient
`storage, i.e. memory, therefor. On the other hand, if the
`matrix signals Wu are not stored locally within each
`processing element NJ, but instead inputted as needed,
`the global interconnections necessary to do this become
`complex and impractical. Either many parallel
`input
`lines, e. g. a wide signal bus structure, or a large number
`of clock cycles must be provided.
`A third problem involves the amount of time required
`to perform the matrix-vector multiplication, i.e. ZK—l
`clock cycles for the STAMS systolic array. Although
`the ring systolic array requires only K clock cycles, the
`problem remains, as discussed immediately above, of
`providing either sufficient local matrix signal storage or
`complex global interconnections.
`One approach to addressing these problems of inter-
`connects, storage area and processing time involves the
`use of multidimensional systolic processing arrays. For
`example, parallelism,
`i.e. parallel processing, can be
`introduced by subdividing the matrix signals WU and
`vector signals VJ. This can be diagrammatically visual-
`ized as seen in FIGS. SA—sB. This can be expressed
`mathematically by the following formula:
`
`P—lQ—l
`V
`0 = 2
`2 W
`l E=0F=0 LQE+F+I QE+F+I
`
`Each row I of the matrix W is divided into P groups
`of Q signals Wu. In other words, the first of the P
`groups of Q signals Wu contains the matrix signals
`W1'|~W1,Q. Similarly, the vector V is divided into P
`groups of Q voltages VJ. For example, the first of the P
`groups of Q voltages VJ includes the voltages VPVQ.
`This can be visualized in even simpler form as shown in
`FIG. SB.
`The processing of these P groups of Q signals W;J,
`VJ can be accomplished by using several one-
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`6O
`
`65
`
`5,274,832
`
`4
`dimensional STAMS systolic arrays, such as that shown
`in FIG. 4, in parallel, as shown in FIG. 6A. The opera-
`tion of each separate systolic array is in accordance
`with that described for the one-dimensional STAMS
`systolic array of FIG. 4 above, with the exception that
`only Q, rather than K, processing (i.e. clock) cycles are
`required for each systolic array to complete one sub-
`product. The subproducts of each array are then
`summed together to provide the final product outputs
`01. Visualizing this systolic array configuration as two-
`dimensional is perhaps more easily done by referring to
`FIG. 6B.
`
`This two-dimensional systolic array configuration is
`an improvement over the one-dimensional STAMS
`configuration, with respect to processing time. Process-
`ing time is reduced since each one-dimensional array,
`i.e. each pipeline of processors, within the two-dimen-
`sional array is shorter and more processing is done in
`parallel. This configuration requires only K+Q—l
`clock cycles to obtain the product outputs 01 of the
`matrix-vector multiplication.
`Further improvement has been achieved by extend-
`ing the twoodimensional STAMS systolic array of FIG.
`6A to a three-dimensional systolic array. This can be
`done by further subdividing the matrix W and vector V
`signals into T groups of P groups of Q signals W1,J, VJ.
`This can be visualized diagrammatically by referring to
`FIGS. 7A—7B. This can be expressed mathematically by
`the following formula:
`
`0,:
`
`T—lP—lQ—l
`2
`2
`2
`6:0 E=D F=0 WI.PQG+ DE+F+ l VPQG+QE+F+1
`
`As seen in FIG. 7A, each row I of the matrix W and
`the vector V is divided into T groups, which in turn are
`divided into P groups of Q signals Wu. VJ. For exam-
`ple, the first of the P groups within the first of the T
`groups contain the matrix signals W1,i-W1,Q and the
`vector signals Vl-VQ. FIG. 78 represents a more sim-
`plified depiction of this multiple subdivision of the ma-
`trix W and vector V signals.
`Referring to FIG. 8A, a realization of such a three-di-
`mensional systolic array is illustrated. Two-dimensional
`systolic arrays, similar to that illustrated in FIG. 6A, are
`disposed as if on T parallel planes. The operation of
`each of the T two-dimensional systolic arrays is similar
`to that as s described above for FIG. 6A. The sub-
`product outputs of each of the T two-dimensional ar-
`rays are summed together to produce the full product
`outputs 01. The threedimensional nature of this array
`can perhaps be better visualized by referring to FIG.
`SB.
`This three-dimensional STAMS systolic array con-
`figuration is an improvement over the two-dimensional
`configuration inasmuch as fewer processing, i.e. clock,
`cycles are required to complete each product output 0].
`Processing time is reduced since each one-dimensional
`array,
`i.e. each and vector signals. It would be still
`further desirable to have such a multidimensional sys-
`tolic processing array with minimal matrix signal stor-
`age requirements for each processing element.
`SUMMARY OF THE INVENTION
`
`A multidimensional systolic array processor in accor-
`dance with the present invention has an architecture
`which maximizes processing parallelism and minimizes
`global interconnections. Funher, the present invention
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`Petitioner Microsoft Corporation - EX. 1020, p. 21
`
`

`

`5
`minimizes local matrix signal storage requirements
`within each processing element.
`The present invention maximizes processing parallel-
`ism by interconnecting its processing elements along
`multiple dimensions. Therefore,
`systolic processing
`occurs along multiple dimensions. For example, a two-
`dimensional systolic array processor in accordance with
`the present invention includes a PxQ matrix having P
`rows and Q columns of processing elements, each of
`which is systolically coupled row-to-row and column-
`to-column for full pipeline processing within each row
`and each column. A three-dimensional systolic array
`processor has a PxQxT array with T planes of P rows
`and Q columns of processing elements, each of which is
`systolically coupled plane-to-plane-to-plane for
`full
`pipeline processing.
`The present invention minimizes global interconnec-
`tions of the processing elements. For the two-dimen-
`sional case, appropriate matrix and vector signal subsets
`are inputted to only one row and one column of the
`two-dimensional processing array. These matrix and
`vector signal subsets are specifically formed so that they
`need to be inputted to only one row and one column,
`and yet still be properly processed systolically along all
`dimensions within the array.
`For the three-dimensional case, appropriate matrix
`and vector signal subsets are inputted to three perpen-
`dicular planes of the three-dimensional processing ar-
`ray. For higher-dimensional cases, appropriate matrix
`and vector signal subsets are similarly inputted to the
`higher-dimensional processing arrays.
`The present invention minimizes local matrix signal
`storage requirements by inputting specifically formed
`matrix signal subsets to only one row and one column of
`the two-dimensional processing array, and to three
`perpendicular planes of the three-dimensional process-
`ing array. These matrix signal subsets are formed to
`allow the sizes of the matrix signal subsets to be reduced
`as they are systolically transferred to subsequent pro-
`cessing elements along each dimension within the array.
`As the matrix signal subsets decrease in size through the
`array, the local storage, e.g. memory, needed for tempo-
`rarily storing each matrix signal subset within each
`processing element is successively reduced. Processing
`speed is not sacrificed since the matrix signal subsets are
`transferred to the subsequent processing element at a
`clock rate higher than that used for systolically transfer-
`ring the vector signal subsets.
`These and other objectives, features and advantages
`of the present invention will be understood upon con-
`sideration of the following detailed description of the
`invention and the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`two-layer
`
`1 illustrates diagrammatically a conventional
`FIG.
`matrix-vector multiplication.
`FIG. 2 illustrate a simple conventional
`artificial neural network.
`FIG. 3 illustrates a conventional one-dimensional
`“ring" systolic array.
`FIG. 4 illustrates an alternative conventional one-
`dimensional systolic array.
`FIGS. SA-SB illustrate diagrammatically a conven»
`tional matrix-vector multiplication, wherein the matrix
`and vector are subdivided into matrix and
`FIGS. 6A-68 illustrate a conventional quasi two-di-
`mensional systolic array.
`
`l0
`
`IS
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,274,832
`
`6
`FIGS. 7A—7B illustrate diagrammatically a conven-
`tional matrix-vector multiplication. wherein the matrix
`and vector of FIGS. SA—SB are further subdivided into
`matrix and vector subsets, respectively.
`FIGS. 8A—88 illustrate a conventional quasi three-di-
`mensional systolic array.
`FIG. 9 illustrates the Layer 1 and 2 neurons of FIG.
`2 reconfigured as two-dimensional neural arrays in ac-
`cordance with the present invention.
`FIG. 10 illustrates diagrammatically the reconfigura-
`tion of the one—dimensional vector signal set of FIG. 1
`into a two-dimensional vector signal set in accordance
`with the present invention.
`FIG. 11 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector column sig-
`nal subsets in accordance with the present invention.
`FIG. 12 illustrates diagrammatically the matrix~vec-
`tor multiplication of the matrix and vector row signal
`subsets in accordance with the present invention.
`FIG. 13 illustrates a block diagram of a two-dimen-
`sional systolic array processor in accordance with the
`present invention.
`FIG. 14 illustrates a single processing element of the
`two-dimensional processor of FIG. 13.
`FIG. 15 illustrates a functional block diagram of a
`single processing element of the two-dimensional pro-
`cessor of FIG. 13.
`FIG. 16 illustrates the reduced local matrix signal
`storage requirements of the two-dimensional processor
`of FIG. 13.
`FIG. 17 further illustrates the reduced local matrix
`signal storage requirements of the two-dimensional
`processor of FIG. 13.
`FIG. 18 illustrates a block diagram of a three~dimen-
`sional systolic array processor in accordance with the
`present invention.
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`Referring to FIG. 9, the Layer 1 and 2 neurons n1. it:
`of an artificial neural network (as shown in FIG. 1) are
`reconfigured into two-dimensional neural arrays. The
`original input signals 1; now have double subscripted
`notation to reflect
`the two-dimensional set of input
`signals Ixz. Similarly, the original voltage signals VJ
`now have double subscripted notation to reflect
`the
`two—dimensionality of the set of voltage signals sz.
`Indicated in brackets for some of the Layer 1 neurons in
`FIG. 9 are the original Layer 1 neuron and voltage
`identifiers (the remaining identifiers being left out for
`clarity).
`The Layer 2 neurons also now have double sub-
`scripted notation to reflect the new two-dimensionality
`of the set of Layer 2 neurons n43. Indicated in brackets
`for some of the Layer 2 neurons are the original Layer
`2 neuron identifiers (the remaining identifiers being left
`out for clarity).
`The original matrix, e.g. “weight.” signals Wu now
`have quadruple subscripted notation to reflect their new
`multidimensionality, i.e. WM; “2. The first subscripted
`pair of characters “AB" represents the destination neu-
`ron n,“ in the second layer. The second subscripted
`pair of characters “Y,Z" represents the source of the
`voltage signal ng in the first layer. Indicated in brack-
`ets for some of the matrix signals W43; Y.Z are the origi-
`nal matrix signal identifiers Wu (the remaining identi-
`fiers being left out for clarity).
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`Petitioner Microsoft Corporation - EX.1020,p.22
`
`

`

`5,274,832
`
`8
`~continued
`= Y + l, for odd Y.
`
`7
`It should be understood that the representations of
`the Layer 1 and 2 neurons, along with their subscripted
`notations, were selected arbitrarily. They can be recon-
`figured as desired. provided that
`the resulting sub-
`scripted notation for the matrix signals be consistent
`therewith.
`It should be further understood that it has been as-
`sumed for the sake of simplicity that M=K for the
`reconfigured array of Layer 2 neurons as represented in
`FIG. 9. However, this is not necessary to the present
`invention. Ideally, the numbers of rows and columns
`should be equal, i.e. P=Q. This is to maximize the im-
`provements in processing speed and to minimize the
`local matrix signal storage requirements in accordance
`with the present invention. However,
`if K cannot be expressed as a square, i.e. if P¢Q, then
`the numbers of neurons in both Layers 1 and 2 can be
`approximated to the nearest square. Extra processing
`cycles would be required because of such an approxima-
`tion, but significant improvements in processing speed
`and local matrix signal storage requirements would still
`be realized.
`
`Referring to FIG. 10, the reconfiguration of the one-
`dimensional vector signal set VJOf FIG. 1 into a two-di—
`mensional vector signal set ng can be understood.
`The one-dimensional vector signal set V] is initially
`mapped into a two-dimensional vector signal set with
`the original subscripted notation left intact. This two-di-
`mensional vector signal set is then given new double
`subscripted notation to reflect the two—dimensionality
`of the reconfigured vector signal set Vy,z.
`Also indicated in FIG. 10 (with dashed lines within
`the two-dimensional vector signal sets) are two vector
`signal subsets. One is referred to as the vector column
`signal subset Vcand the other is referred to as the vec-
`tor row signal subset VR. These vector signal subsets
`V5. V}; are multiplied by the matrix signals W431)»; as
`represented in FIG. 9. The matrix signals wA.B;Y.Z are
`separated into corresponding matrix column WC and
`row W}; signal subsets. The vector column Vc and row
`VR signal subsets are multiplied by the matrix column
`WC and row WR signal subsets, respectively, as shown
`in FIGS. 11 and 12. Therefore, the matrix—vector prod-
`ucts 01 are identified according to the following formu—
`las:
`
`01: Oc+ OR
`
`V
`E W .
`9
`2:] Y=X A.B,Y,Z J
`2
`J=Y—Z—P+ 2 (P—Z+2)
`Z=l
`
`X=Z.foroddZ
`=Z+ l,foreven Z;
`
`and
`
`O =
`9

`R Y=IZ=X
`
`WAJ; Y.ZVJ
`
`Y
`
`J=Z—Y—Q+Y_I_l(Q-Y+2l
`X = Y. for even Y
`
`5
`
`Based upon the foregoing. the vector Vc, VR and
`matrix WC, W3 signal subsets can be identified accord-
`ing to the following formulas:
`
`10
`
`IS
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`SS
`
`65
`
`where:
`
`Z
`J=Y—Z—P+ I (P—Z+2)
`Z=l
`
`Z=l.2.3,.-..Q
`
`Y=X.X+I.X+2.....P
`
`X = Z. for odd Z
`= Z + l. for even Z;
`
`and
`
`VR = Vy'z = V]
`
`WR = WA.B,Y.Z
`where:
`
`Y
`
`J=Z——Y—Q+YII(Q-—Y+2)
`Y=l.2,3....,P
`
`Z=X,X+LX+2.r-wQ
`
`X: Y.foreven Y
`= Y+ l,foroddY.
`
`Referring to FIG. 13, a two-dimensional systolic
`processing array 100 in accordance with the present
`invention includes K processing elements 102, desig—
`nated by NAB where A is an element of the set {l,2,3, .
`.
`. ,P}, B is an element of the set {l,2,3. .
`.
`.
`,Q} and
`K=PQ. The processing elements 102 are intercon—
`nected in a two-dimensional matrix 100 having P rows
`and Q columns.
`The processing elements 102 are mutually coupled
`column-to-column via matrix row subset signal lines 104
`and vector row subset signal
`lines 106. These signal
`lines 104, 106 provide the means by which the matrix
`W); and vector VR row signal subsets are systolically
`transferred column-to—column among the processing
`elements 102. The processing elements 102 are further
`mutually coupled row-to-row via matrix column subset
`signal lines 108 and vector column subset signal lines
`110. It is by these signal lines 108, 110 that the matrix
`WC and vector Vc signal subsets are systolically trans-
`ferred row-to-row among the processing elements 102.
`All processing elements 102 in the first row 112 re-
`ceive a vector signal input 114 which is a vector column
`signal subset Vc of the vector signal set V. All process-
`ing elements 102 in the first row 112 further receive a
`matrix signal input 116 which is a matrix column signal
`subset We of the matrix signal set W.
`All processing elements 102 in the first column 118
`receive a vector signal input 120 which is a vector row
`signal subset VR of the vector signal set V. All process-
`ing elements 102 in the first column 118 further receive
`a matrix signal input 122 which is a matrix row signal
`subset WR of the matrix signal set W.
`All processing elements 102 within the matrix 100
`further receive two clock signals. The

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket