`United States Patent
`5,274,832
`[11] Patent Number:
`[45] Date of Patent: Dec. 28, 1993
`Khan
`
`
`
`||||||llllllllllllllllIlllllllllIlllllllllIllllllllllllllllllllllllllllllll
`USOOS274832A
`
`[s4] svsrouc ARRAY FOR
`MULTIDIMENSIONAL MATRIX
`COMPUTATIONS
`
`[75]
`
`Inventor:
`
`Emdadur R. Khan, San Jose. Calif.
`
`[73] Assignee: National Semiconductor Corporation,
`Santa Clara. Calif.
`
`[21] App]. No.: 592,954
`
`[22] Filed:
`
`Oct. 4, 1990
`
`[51]
`[52]
`
`Int. Cl.5 ............................................ cosF 15/347
`
`[1.8. CI. ............. 395/800; 364/271.2;
`364/258; 364/2582; 364/2768; 364/DIG. 1;
`364/754
`[58] Field of Search ................... 364/754. 845,728.05,
`364/841; 395/800
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`364/754
`1/1985 Kung
`4,493,048
`.. 364/845
`7/1986 Bocker ......
`4,603.398
`364/728
`4,701,876 10/1987 McCanny ......
`
`364/754
`5,107,452
`4/1992 Karmarkar
`5,146,543
`9/1992 Vassiliadis ............................. 395/27
`
`OTHER PUBLICATIONS
`
`S. Y. Kung & J. N. Hwang, “Parallel Architecture for
`Artificial Neural Nets,” lJCNN, 1989, pp. 11—165—11-
`—l72.
`
`N. Ling & M. A. Bayoumi, “Algorithms for High Speed
`Mum-Dimensional Arithmetic and DSP Systolic Ar-
`rays," Proceedings of the 1988 International Confer-
`ence on Parallel Processing. pp. 367-374.
`S. Y. Kung & J. N. Hwang, “A Unifying Algorith-
`m/Architecture for Aritificial Neural Networks,"
`IEEE Magazine, Feb. 1989, pp. 2505-2508.
`
`Primary Examiner—Eric Coleman
`Attorney. Agent, or Firm—Limbach & Limbach
`
`ABSTRACT
`[57]
`A multidimensional systolic array processor uses a mul-
`tidimensional array of systolically coupled processing
`elements to perform matrix-vector multiplication of
`matrix and vector signal sets. A two-dimensional array
`uses a PxQ matrix (P rows and Q columns) of process-
`ing elements which are coupled to systolically process
`the signals, e.g. via multiplication and accumulation.
`The processing elements are coupled both row-to-row
`and column-to—column for pipeline processing within
`each row and each column, i.e. multidimensional pipe-
`lining,
`thereby increasing processing parallelism and
`speed. Interconnectivity of the processing elements is
`minimized by forming separate column and row signal
`subsets of the vector signal set which are coupled simul-
`taneously to each processing element in the first row
`and first column, respectively. Size of the processing
`elements is minimized by reducing local storage of ma-
`trix signal subsets within each processing element. Sepa-
`rate column and row signal subsets of the matrix signal
`set are formed and coupled into each processing ele-
`ment of the first row and first column, respectively. As
`the matrix column and row signal subsets are systoli-
`cally processed and transferred row-to-row and co-
`lumn-to-column, respectively, each signal subset is re-
`duced in size by one signal, thereby requiring the trans-
`fer and temporary local storage of successively smaller
`matrix signal subsets. A three-dimensional processor
`uses a PXQXT array (T planes of P rows and Q col-
`umns) of processing elements which are coupled plane-
`to-plane-to-plane.
`
`45 Claims, 18 Drawing Sheets
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`Petitioner Microsoft Corporation - EX. 1020, p. 1
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 1 of 18
`
`5,274,832
`
`"1.1
`
`W1,2
`
`Wu '
`
`° "W
`
`W22
`
`w2.1
`
`“3.1
`.
`
`' .
`
`WI";
`
`w2.I<
`
`"3.x
`'
`
`Wm
`
`Wm
`
`WMJ'
`
`'
`
`' Wm
`
`V1
`V2
`v3
`
`;
`VJ
`
`VK
`
`=
`
`01
`02
`03
`
`;
`OI
`
`0M
`
`(PRIOR ART)
`FIG.
`1
`
`"1.1
`
`V1
`
`n2.1
`
`"2.3
`
`13
`
`IJ
`
`14
`
`'
`
`"1.3 V3
`0
`
`O
`
`O
`
`’“2.I
`"1,J'VJ mp.
`"”6VA 2 "2’“
`
`® .
`
`‘
`
`an O
`
`(m
`
`0
`
`LAYER 1
`
`LAYER 2
`
`(PRIOR ART)
`FIG. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`Petitioner Microsoft Corporation - EX. 1020, p. 2
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 2 of 18
`
`5,274,832
`
`Tx
`ggooox‘
`B:
`3
`
`N
`Soo-i‘sS-n
`
`E o
`
`o
`
`0
`
`£1.
`
`I—
`
`’6’
`
`B 3” x
`
`n.
`; .
`
`.
`
`x_
`. ;
`
`EN)
`(.52:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`Petitioner Microsoft Corporation - EX. 1020, p. 3
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 3 of 13
`
`5,274,832
`
`x.
`x
`X.
`x_
`00.000 ; :;000:
`
`x>
`
`mower
`
`cums 01
`
`o o o
`
`"7
`2
`
`7')
`>
`
`o
`
`o
`
`n
`:3- 3 3000:.
`3
`3
`3
`3
`
`“1,1
`
`W11
`
`“3.1
`
`“MI
`
`v1
`
`(PRIORART)4
`
`FIG.
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`Petitioner Microsoft Corporation - EX. 1020, p. 4
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 4 of 18
`
`5,274,832
`
`ooo01H30oo
`
`KN’oooOlv—Na;oooooo0NH3—.N3
`
`oofin;
`
`E:SE
`
`é;...if.......Ni;5;
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`Petitioner Microsoft Corporation - EX. 1020, p. 5
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 5 of 18
`
`5,274,832
`
`w“n+2 '
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`Petitioner Microsoft Corporation - EX. 1020, p. 6
`
`V’zx—on
`
`. “
`
`wk-0+1
`
`Wux-o
`
`V1 W1"
`
`V2 0
`
`w1.2
`
`”:12
`
`. “
`
`m
`
`w2.1
`
`O.
`
`“um
`
`Vo+1 w1.0+1 WM 0
`
`“n+1
`
`'
`
`"1.0+2
`
`w2.c)+2
`
`Wu.o+1
`
`.
`
`.
`
`.
`
`
`
`US. Patent
`
`Dec. 23, 1993
`
`Sheet 6 of 18
`
`5,274,832
`
`HG.BB
`
`z
`
`2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`Petitioner Microsoft Corporation - EX. 1020, p. 7
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 7 of 13
`
`5,274,832
`
`0(JIIIIIK{{—omoooN—>b.oooN—
`
`
`
`
`
`a.>95;..A175;...EN.;..:538;...oc-n_v.£...o~.£..._+o.£3;...E;
`
`9;
`
`
`
`a:E”.H:9;EE...!...E...EB...EB
`
`nE5.:<5.8
`
`
`2—Hmv>amp-2’OOOIIO‘23
`
`—+O>oooo0FN:
`
`ON>oo
`
`...cEE...E
`
`E:83E.o:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`Petitioner Microsoft Corporation - EX. 1020, p. 8
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 8 of 18
`
`5,274,832
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`Petitioner Microsoft Corporation - EX. 1020, p. 9
`
`
`
`US. Patent
`
`Dec.28, 1993
`
`Sheet 9 of 18
`
`5,274,832
`
`f"
`
`V0+1""1,c;+1"q+2w °
`VIZ-'0.”
`1.0+2
`"gm
`w ‘
`.
`“'0“
`
`wM'.0+2
`
`KEYTOHGBA
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`Petitioner Microsoft Corporation - EX. 1020, p. 10
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 10 of 18
`
`5,274,832
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`Petitioner Microsoft Corporation - EX. 1020, p. 11
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 11 of 18
`
`5,274,832
`
`"may..EEO£0
`
`.zwIIIIIIEWJ
`2.22:0«.50«SQ_3.“rIOIO
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`Petitioner Microsoft Corporation - EX. 1020, p. 12
`
`
`
`
`US. Patent
`
`238,4n
`
`5,2.2...
`
`m3,u>
`
`hoIsn..unnu..nun0.II.oao0.Il:ooo..
`
`
`1...m,.3,...n;u;m},23;...13>E;m;miiiDfE,
`.mxiiWiJn.3.3.o3.Sw>mooo>N>>¥>mooo>ON>O>
`
`«+3;...«a:“+3,"5,n_.n>ooa“n.n>Nfi>—.n>fln+0lx>ooomn+ON>n+0>n>m.5...a:q:“a:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`Petitioner Microsoft Corporation - EX. 1020, p. 13
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 13 of 18
`
`5,274,832
`
`fits;
`
`$3.;
`
`flan—#3o
`.oflan—£3
`
`tun—J;
`
`3.;
`
`$5,
`
`«fin—N;
`
`flan—figo
`..E3:
`
`is.
`
`Eng;
`
`$2.;
`
`«3...;
`
`..33.;
`
`flu"fin:
`
`sen:
`
`o>o;
`
`z.0:
`
`oofinu—fit
`
`as;
`
`3.3,
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`Petitioner Microsoft Corporation - EX. 1020, p. 14
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 14 of 18
`
`5,274,832
`
`Qua—.5»
`
`«Nu—#3
`
`Gig;
`
`..33.;
`
`mat—.5»
`
`52.;
`
`Ea;
`
`3a.;
`
`3a.;
`
`ootwuugr
`3a.;
`
`NJNJB
`
`23.;
`
`«Nari;
`
`0.33;
`
`ao*.—un.—3
`min...’
`
`«.333
`
`s,a;
`
`25;
`
`«N"a:3
`
`35;
`
`o0*.—£.—’
`3a.;
`
`NJEJ?
`
`N—.0:
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`Petitioner Microsoft Corporation - EX. 1020, p. 15
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 15 of 18
`
`5,274,832
`
`WTCLOCK
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`Petitioner Microsoft Corporation - EX. 1020, p. 16
`
`
`
`. US. Patent
`
`Dec. 28, 1993
`
`Sheet 16 of 18
`
`5,274,832
`
`$6—
`
`8:£2
`
`o>o;
`
`was;3%;
`
`859is._.l.IIlllllllllL
`--i§§_asa.
`3%25m6%“.
`
`o>0;B.0:
`
`8:£28—
`
`l.1..Illlllllll.\|_5EN590
`
`_ P
`
`E-2
`
`
`II-III”flaw
`.mrllll'_
`
`z
`
`§
`
`I
`
`,r
`
`E
`
`05%
`
`.229m3
`
`32
`
`a8—
`
`>“
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`Petitioner Microsoft Corporation - EX. 1020, p. 17
`
`
`
`US. Patent
`
`Dec. 28, 1993
`
`Sheet 17 of 18
`
`5,274,832
`
`
`
`O
`
`O
`
`O m
`
`\ 1
`
`18
`
`a
`
`
`
`I]
`II
`
`122
`"2*“
`VR
`
`104
`I -"13
`'
`
`o
`
`o
`
`o
`
`120
`
`102
`
`105
`
`FIG. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`Petitioner Microsoft Corporation - EX. 1020, p. 18
`
`
`
`
`
`
`r'l‘l'r'fi'l'r
`
`
`léflMEj
`
`lily/5.1%
`‘l-l‘lnu'
`gala-'11;
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 19
`
`
`
`1
`
`5,274,832
`
`5
`
`10
`
`15
`
`2O
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`65
`
`SYSTOLIC ARRAY FOR MULTIDIMENSIONAL
`MATRIX COMPUTATIONS
`
`BACKGROUND OF THE INVENTION
`1. Field of the invention
`The present invention relates to array processors, and
`in particular, to systolic array processors which process
`multiple signals in parallel in multiple dimensions.
`2. Description of the Related Art
`Systolic processors, i.e. processors which systolically
`“pump," or transfer, data from one processing element
`to another, are well known in the art. Systolic arrays
`have been used to increase the pipelined computing
`capability, and therefore the computing speed, of vari-
`ous types of signal processors.
`Systolic array processors are particularly useful for
`processing, e.g. multiplying, two signal sets, where the
`first signal set represents a matrix parameter set and the
`second signal set represents a vector parameter set. In
`other words, the first signal set represents a matrix pa-
`rameter set which can be represented as an M-by-K
`("MXK") matrix having M rows and K columns of
`parameters, and the second signal set represents a le
`vector having K rows and 1 column of parameters.
`Referring to FIG. 1, a representation of the matrix-
`vector multiplication of two such signal sets can be
`seen. The matrix signal set W has matrix signals W;J
`and the vector signal set V has vector signals VJ, where
`l is an element of the set {l,2,3, .
`.
`.
`,M} and J is an
`element of the set {1,2,3, .
`.
`. ,K}. This can be expressed
`mathematically by the following formula:
`
`KI
`
`J: 1 WIJVJ
`
`01
`
`Such signal sets are also found in many artificial neu-
`ral network models, including the Hopfield neural net-
`work model. Referring to FIG. 2, a simple artificial
`neural network with its associated signal sets can be
`seen. The first layer of neurons nu. or nodes, receives
`some form of input signals lJ, and based thereon, gener-
`ates a number of voltage signals VJ, which can be repre-
`sented as a voltage vector V.
`Coupling the respective voltage signals VJ to the
`second layer of neurons nu are a number of scaling
`elements (e.g. “adaptive weights”), which introduce
`scaling, or “weigh,” signals Wu for
`scaling or
`“weighting” the voltage signals VJ prior to their being
`received by the second layer neurons nu. It will be
`understood that, with respect to the subscripted nota-
`tion for representing the scaling or weighting signals
`Wu, the first subscripted character “I” represents the
`destination neuron an in the second layer, and the sec-
`ond subscripted character “J” represents the source
`neuron nu of the voltage signal VJ in the first layer.
`The simplest form of systolic processing array used
`for performing the matrix-vector multiplication of sig-
`nal sets, as discussed above,
`is one-dimensional. One
`type of one-dimensional systolic array is a "ring“ sys-
`tolic array, shown in FIG. 3.
`The systolically coupled processing elements NJ are
`interconnected as shown, with signal flow represented
`by the arrows. First, the corresponding voltage signals
`VJ are initially coupled into their corresponding pro-
`cessing elements NJ. Then, following the application of
`each clock pulse (not shown, but common to each pro-
`cessing element NJ), the matrix signals WJJ are sequen-
`
`2
`tially inputted to their corresponding processing ele-
`ment NJ, as shown. Therein, each matrix signal WIIJ is
`multiplied by its corresponding voltage signal VJ and
`accumulated, i.e. stored, within the processing element
`NJ.
`the foregoing is
`Following the next clock signal.
`repeated, with the voltage signals VJ being transferred
`to subsequent processing elements NJ to be multiplied
`by the corresponding matrix signal Wu therein. For
`example, the voltage signals V] which are transferred
`between the processing elements NJ are shown in paren-
`theses. This is repeated K—l times, i.e. for a total of K
`times, to produce the final matrix-vector product out-
`puts 01. The “ring" configuration facilitates multiple
`iterations of the matrix-vector products, a desirable
`feature used in the learning phase of an artificial neural
`network. Further discussions of the ring systolic array
`can be found in “Parallel Architectures for Artificial
`Neural Nets," by S. Y. Kung and J. N. Hwang, UCNN
`1989, pp. 11-165 through Il-l72.
`A second type of one-dimensional systolic array re-
`lies on a configuration in accordance with the
`“STAMS” (Systematic Transformation of Algorithms
`for Multidimensional Systolic arrays) technique. The
`STAMS technique is discussed in detail in “Algorithms
`for High Speed Multidimensional Arithmetic and DSP
`Systolic Arrays,” by N. Ling and M. A. Bayoumi, Pro-
`ceedings of the 1988 lntemational Conference on Paral-
`lel Processing, Vol. I, pp. 367—374. An example of a
`one-dimensional STAMS systolic array is shown in
`FIG. 4. dimensional arrays is shorter and more process-
`ing is done in parallel. This three-dimensional configu-
`ration requires only T+K—1 clock cycles.
`Even though the two-dimensional and three-dimen-
`sional STAMS systolic array configurations discussed
`above provide improvements with respect to processing
`speed, minimal if any improvement is provided with
`respect to the number and complexity of the local or
`global interconnections required for inputting the ma—
`trix W and vector V signals. Furthermore, even though
`the one—dimensional ring systolic array already provides
`reasonable processing speed, its requisite global inter-
`connections are complex and impractical. Moreover, no
`improvements are provided by any of the foregoing
`arrays with respect to matrix signal Wu storage re-
`quirements.
`Moreover, the two- and three-dimensional STAMS
`systolic array configurations described above are not
`truly two- or three-dimensional, respectively. The two-
`dimensional array, as well as each two-dimensional
`array plane within the three-dimensional array, have
`their processing elements NAB interconnected along
`one dimension only, e.g. left to right. Therefore, the
`systolic processing actually occurs in one dimension
`only. Thus, full multidimensional parallelism or pipelin-
`ing is not achieved and maximum processing speed, i.e.
`minimum processing time, cannot be achieved.
`It would be desirable to have a true multidimensional
`systolic array configuration providing true multidimen-
`sional pipeline operation to maximize processing speed.
`It would be further desirable to have such a multidimen-
`sional systolic processing array in which minimal global
`or local interconnects are required for inputting the
`matrix
`
`First, just as in the ring systolic array of FIG, 3. the
`voltage signals VJ are initially inputted into their re-
`spective processing elements NJ. Then, the matrix sig-
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`Petitioner Microsoft Corporation - EX. 1020, p. 20
`
`
`
`3
`nals W]J are inputted into the processing elements NJ,
`with each respective processing element NJ receiving
`one column of the matrix of matrix signals W“, as
`shown. The weight-voltage products are summed with
`the corresponding weight-voltage products from the
`preceding processing element NJ_1 and then systoli-
`cally transferred to the next processing element NJ+],
`and the process continues.
`The inputting of the matrix signals Wu into each
`successive processing element NJ is delayed by one
`additional clock pulse per processing element stage to
`allow for the delays associated with the systolic trans-
`ferring of the accumulated products. This delay can be
`accomplished by inputting zeros to a processing ele.
`ment NJ until the systolically transferred accumulated
`products begin to arrive. However, this delay adversely
`affects the processing speed. As compared to the ring
`systolic array of FIG. 3 which requires K clock cycles,
`the STAMS systolic array requires ZK-l clock cycles
`to obtain the product outputs O] of this matrix-vector
`multiplication.
`A number of problems are associated with using these
`one—dimensional systolic arrays. One problem involves
`the inputting of the voltage signals VJ. If the voltages
`VJ are to be loaded simultaneously in parallel, global
`interconnects are required to accomplish this. If they
`are to be loaded sequentially in serial, numerous local
`interconnects are required, as well as K clock cycles.
`Another problem involves the inputting of the matrix
`signals Wu. If the matrix signals Wu are stored locally
`within each processing element NJ, the processing ele-
`ments NJ must be large enough to provide sufficient
`storage, i.e. memory, therefor. On the other hand, if the
`matrix signals Wu are not stored locally within each
`processing element NJ, but instead inputted as needed,
`the global interconnections necessary to do this become
`complex and impractical. Either many parallel
`input
`lines, e. g. a wide signal bus structure, or a large number
`of clock cycles must be provided.
`A third problem involves the amount of time required
`to perform the matrix-vector multiplication, i.e. ZK—l
`clock cycles for the STAMS systolic array. Although
`the ring systolic array requires only K clock cycles, the
`problem remains, as discussed immediately above, of
`providing either sufficient local matrix signal storage or
`complex global interconnections.
`One approach to addressing these problems of inter-
`connects, storage area and processing time involves the
`use of multidimensional systolic processing arrays. For
`example, parallelism,
`i.e. parallel processing, can be
`introduced by subdividing the matrix signals WU and
`vector signals VJ. This can be diagrammatically visual-
`ized as seen in FIGS. SA—sB. This can be expressed
`mathematically by the following formula:
`
`P—lQ—l
`V
`0 = 2
`2 W
`l E=0F=0 LQE+F+I QE+F+I
`
`Each row I of the matrix W is divided into P groups
`of Q signals Wu. In other words, the first of the P
`groups of Q signals Wu contains the matrix signals
`W1'|~W1,Q. Similarly, the vector V is divided into P
`groups of Q voltages VJ. For example, the first of the P
`groups of Q voltages VJ includes the voltages VPVQ.
`This can be visualized in even simpler form as shown in
`FIG. SB.
`The processing of these P groups of Q signals W;J,
`VJ can be accomplished by using several one-
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`6O
`
`65
`
`5,274,832
`
`4
`dimensional STAMS systolic arrays, such as that shown
`in FIG. 4, in parallel, as shown in FIG. 6A. The opera-
`tion of each separate systolic array is in accordance
`with that described for the one-dimensional STAMS
`systolic array of FIG. 4 above, with the exception that
`only Q, rather than K, processing (i.e. clock) cycles are
`required for each systolic array to complete one sub-
`product. The subproducts of each array are then
`summed together to provide the final product outputs
`01. Visualizing this systolic array configuration as two-
`dimensional is perhaps more easily done by referring to
`FIG. 6B.
`
`This two-dimensional systolic array configuration is
`an improvement over the one-dimensional STAMS
`configuration, with respect to processing time. Process-
`ing time is reduced since each one-dimensional array,
`i.e. each pipeline of processors, within the two-dimen-
`sional array is shorter and more processing is done in
`parallel. This configuration requires only K+Q—l
`clock cycles to obtain the product outputs 01 of the
`matrix-vector multiplication.
`Further improvement has been achieved by extend-
`ing the twoodimensional STAMS systolic array of FIG.
`6A to a three-dimensional systolic array. This can be
`done by further subdividing the matrix W and vector V
`signals into T groups of P groups of Q signals W1,J, VJ.
`This can be visualized diagrammatically by referring to
`FIGS. 7A—7B. This can be expressed mathematically by
`the following formula:
`
`0,:
`
`T—lP—lQ—l
`2
`2
`2
`6:0 E=D F=0 WI.PQG+ DE+F+ l VPQG+QE+F+1
`
`As seen in FIG. 7A, each row I of the matrix W and
`the vector V is divided into T groups, which in turn are
`divided into P groups of Q signals Wu. VJ. For exam-
`ple, the first of the P groups within the first of the T
`groups contain the matrix signals W1,i-W1,Q and the
`vector signals Vl-VQ. FIG. 78 represents a more sim-
`plified depiction of this multiple subdivision of the ma-
`trix W and vector V signals.
`Referring to FIG. 8A, a realization of such a three-di-
`mensional systolic array is illustrated. Two-dimensional
`systolic arrays, similar to that illustrated in FIG. 6A, are
`disposed as if on T parallel planes. The operation of
`each of the T two-dimensional systolic arrays is similar
`to that as s described above for FIG. 6A. The sub-
`product outputs of each of the T two-dimensional ar-
`rays are summed together to produce the full product
`outputs 01. The threedimensional nature of this array
`can perhaps be better visualized by referring to FIG.
`SB.
`This three-dimensional STAMS systolic array con-
`figuration is an improvement over the two-dimensional
`configuration inasmuch as fewer processing, i.e. clock,
`cycles are required to complete each product output 0].
`Processing time is reduced since each one-dimensional
`array,
`i.e. each and vector signals. It would be still
`further desirable to have such a multidimensional sys-
`tolic processing array with minimal matrix signal stor-
`age requirements for each processing element.
`SUMMARY OF THE INVENTION
`
`A multidimensional systolic array processor in accor-
`dance with the present invention has an architecture
`which maximizes processing parallelism and minimizes
`global interconnections. Funher, the present invention
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`Petitioner Microsoft Corporation - EX. 1020, p. 21
`
`
`
`5
`minimizes local matrix signal storage requirements
`within each processing element.
`The present invention maximizes processing parallel-
`ism by interconnecting its processing elements along
`multiple dimensions. Therefore,
`systolic processing
`occurs along multiple dimensions. For example, a two-
`dimensional systolic array processor in accordance with
`the present invention includes a PxQ matrix having P
`rows and Q columns of processing elements, each of
`which is systolically coupled row-to-row and column-
`to-column for full pipeline processing within each row
`and each column. A three-dimensional systolic array
`processor has a PxQxT array with T planes of P rows
`and Q columns of processing elements, each of which is
`systolically coupled plane-to-plane-to-plane for
`full
`pipeline processing.
`The present invention minimizes global interconnec-
`tions of the processing elements. For the two-dimen-
`sional case, appropriate matrix and vector signal subsets
`are inputted to only one row and one column of the
`two-dimensional processing array. These matrix and
`vector signal subsets are specifically formed so that they
`need to be inputted to only one row and one column,
`and yet still be properly processed systolically along all
`dimensions within the array.
`For the three-dimensional case, appropriate matrix
`and vector signal subsets are inputted to three perpen-
`dicular planes of the three-dimensional processing ar-
`ray. For higher-dimensional cases, appropriate matrix
`and vector signal subsets are similarly inputted to the
`higher-dimensional processing arrays.
`The present invention minimizes local matrix signal
`storage requirements by inputting specifically formed
`matrix signal subsets to only one row and one column of
`the two-dimensional processing array, and to three
`perpendicular planes of the three-dimensional process-
`ing array. These matrix signal subsets are formed to
`allow the sizes of the matrix signal subsets to be reduced
`as they are systolically transferred to subsequent pro-
`cessing elements along each dimension within the array.
`As the matrix signal subsets decrease in size through the
`array, the local storage, e.g. memory, needed for tempo-
`rarily storing each matrix signal subset within each
`processing element is successively reduced. Processing
`speed is not sacrificed since the matrix signal subsets are
`transferred to the subsequent processing element at a
`clock rate higher than that used for systolically transfer-
`ring the vector signal subsets.
`These and other objectives, features and advantages
`of the present invention will be understood upon con-
`sideration of the following detailed description of the
`invention and the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`two-layer
`
`1 illustrates diagrammatically a conventional
`FIG.
`matrix-vector multiplication.
`FIG. 2 illustrate a simple conventional
`artificial neural network.
`FIG. 3 illustrates a conventional one-dimensional
`“ring" systolic array.
`FIG. 4 illustrates an alternative conventional one-
`dimensional systolic array.
`FIGS. SA-SB illustrate diagrammatically a conven»
`tional matrix-vector multiplication, wherein the matrix
`and vector are subdivided into matrix and
`FIGS. 6A-68 illustrate a conventional quasi two-di-
`mensional systolic array.
`
`l0
`
`IS
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,274,832
`
`6
`FIGS. 7A—7B illustrate diagrammatically a conven-
`tional matrix-vector multiplication. wherein the matrix
`and vector of FIGS. SA—SB are further subdivided into
`matrix and vector subsets, respectively.
`FIGS. 8A—88 illustrate a conventional quasi three-di-
`mensional systolic array.
`FIG. 9 illustrates the Layer 1 and 2 neurons of FIG.
`2 reconfigured as two-dimensional neural arrays in ac-
`cordance with the present invention.
`FIG. 10 illustrates diagrammatically the reconfigura-
`tion of the one—dimensional vector signal set of FIG. 1
`into a two-dimensional vector signal set in accordance
`with the present invention.
`FIG. 11 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector column sig-
`nal subsets in accordance with the present invention.
`FIG. 12 illustrates diagrammatically the matrix~vec-
`tor multiplication of the matrix and vector row signal
`subsets in accordance with the present invention.
`FIG. 13 illustrates a block diagram of a two-dimen-
`sional systolic array processor in accordance with the
`present invention.
`FIG. 14 illustrates a single processing element of the
`two-dimensional processor of FIG. 13.
`FIG. 15 illustrates a functional block diagram of a
`single processing element of the two-dimensional pro-
`cessor of FIG. 13.
`FIG. 16 illustrates the reduced local matrix signal
`storage requirements of the two-dimensional processor
`of FIG. 13.
`FIG. 17 further illustrates the reduced local matrix
`signal storage requirements of the two-dimensional
`processor of FIG. 13.
`FIG. 18 illustrates a block diagram of a three~dimen-
`sional systolic array processor in accordance with the
`present invention.
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`Referring to FIG. 9, the Layer 1 and 2 neurons n1. it:
`of an artificial neural network (as shown in FIG. 1) are
`reconfigured into two-dimensional neural arrays. The
`original input signals 1; now have double subscripted
`notation to reflect
`the two-dimensional set of input
`signals Ixz. Similarly, the original voltage signals VJ
`now have double subscripted notation to reflect
`the
`two—dimensionality of the set of voltage signals sz.
`Indicated in brackets for some of the Layer 1 neurons in
`FIG. 9 are the original Layer 1 neuron and voltage
`identifiers (the remaining identifiers being left out for
`clarity).
`The Layer 2 neurons also now have double sub-
`scripted notation to reflect the new two-dimensionality
`of the set of Layer 2 neurons n43. Indicated in brackets
`for some of the Layer 2 neurons are the original Layer
`2 neuron identifiers (the remaining identifiers being left
`out for clarity).
`The original matrix, e.g. “weight.” signals Wu now
`have quadruple subscripted notation to reflect their new
`multidimensionality, i.e. WM; “2. The first subscripted
`pair of characters “AB" represents the destination neu-
`ron n,“ in the second layer. The second subscripted
`pair of characters “Y,Z" represents the source of the
`voltage signal ng in the first layer. Indicated in brack-
`ets for some of the matrix signals W43; Y.Z are the origi-
`nal matrix signal identifiers Wu (the remaining identi-
`fiers being left out for clarity).
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`Petitioner Microsoft Corporation - EX.1020,p.22
`
`
`
`5,274,832
`
`8
`~continued
`= Y + l, for odd Y.
`
`7
`It should be understood that the representations of
`the Layer 1 and 2 neurons, along with their subscripted
`notations, were selected arbitrarily. They can be recon-
`figured as desired. provided that
`the resulting sub-
`scripted notation for the matrix signals be consistent
`therewith.
`It should be further understood that it has been as-
`sumed for the sake of simplicity that M=K for the
`reconfigured array of Layer 2 neurons as represented in
`FIG. 9. However, this is not necessary to the present
`invention. Ideally, the numbers of rows and columns
`should be equal, i.e. P=Q. This is to maximize the im-
`provements in processing speed and to minimize the
`local matrix signal storage requirements in accordance
`with the present invention. However,
`if K cannot be expressed as a square, i.e. if P¢Q, then
`the numbers of neurons in both Layers 1 and 2 can be
`approximated to the nearest square. Extra processing
`cycles would be required because of such an approxima-
`tion, but significant improvements in processing speed
`and local matrix signal storage requirements would still
`be realized.
`
`Referring to FIG. 10, the reconfiguration of the one-
`dimensional vector signal set VJOf FIG. 1 into a two-di—
`mensional vector signal set ng can be understood.
`The one-dimensional vector signal set V] is initially
`mapped into a two-dimensional vector signal set with
`the original subscripted notation left intact. This two-di-
`mensional vector signal set is then given new double
`subscripted notation to reflect the two—dimensionality
`of the reconfigured vector signal set Vy,z.
`Also indicated in FIG. 10 (with dashed lines within
`the two-dimensional vector signal sets) are two vector
`signal subsets. One is referred to as the vector column
`signal subset Vcand the other is referred to as the vec-
`tor row signal subset VR. These vector signal subsets
`V5. V}; are multiplied by the matrix signals W431)»; as
`represented in FIG. 9. The matrix signals wA.B;Y.Z are
`separated into corresponding matrix column WC and
`row W}; signal subsets. The vector column Vc and row
`VR signal subsets are multiplied by the matrix column
`WC and row WR signal subsets, respectively, as shown
`in FIGS. 11 and 12. Therefore, the matrix—vector prod-
`ucts 01 are identified according to the following formu—
`las:
`
`01: Oc+ OR
`
`V
`E W .
`9
`2:] Y=X A.B,Y,Z J
`2
`J=Y—Z—P+ 2 (P—Z+2)
`Z=l
`
`X=Z.foroddZ
`=Z+ l,foreven Z;
`
`and
`
`O =
`9
`£
`R Y=IZ=X
`
`WAJ; Y.ZVJ
`
`Y
`
`J=Z—Y—Q+Y_I_l(Q-Y+2l
`X = Y. for even Y
`
`5
`
`Based upon the foregoing. the vector Vc, VR and
`matrix WC, W3 signal subsets can be identified accord-
`ing to the following formulas:
`
`10
`
`IS
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`SS
`
`65
`
`where:
`
`Z
`J=Y—Z—P+ I (P—Z+2)
`Z=l
`
`Z=l.2.3,.-..Q
`
`Y=X.X+I.X+2.....P
`
`X = Z. for odd Z
`= Z + l. for even Z;
`
`and
`
`VR = Vy'z = V]
`
`WR = WA.B,Y.Z
`where:
`
`Y
`
`J=Z——Y—Q+YII(Q-—Y+2)
`Y=l.2,3....,P
`
`Z=X,X+LX+2.r-wQ
`
`X: Y.foreven Y
`= Y+ l,foroddY.
`
`Referring to FIG. 13, a two-dimensional systolic
`processing array 100 in accordance with the present
`invention includes K processing elements 102, desig—
`nated by NAB where A is an element of the set {l,2,3, .
`.
`. ,P}, B is an element of the set {l,2,3. .
`.
`.
`,Q} and
`K=PQ. The processing elements 102 are intercon—
`nected in a two-dimensional matrix 100 having P rows
`and Q columns.
`The processing elements 102 are mutually coupled
`column-to-column via matrix row subset signal lines 104
`and vector row subset signal
`lines 106. These signal
`lines 104, 106 provide the means by which the matrix
`W); and vector VR row signal subsets are systolically
`transferred column-to—column among the processing
`elements 102. The processing elements 102 are further
`mutually coupled row-to-row via matrix column subset
`signal lines 108 and vector column subset signal lines
`110. It is by these signal lines 108, 110 that the matrix
`WC and vector Vc signal subsets are systolically trans-
`ferred row-to-row among the processing elements 102.
`All processing elements 102 in the first row 112 re-
`ceive a vector signal input 114 which is a vector column
`signal subset Vc of the vector signal set V. All process-
`ing elements 102 in the first row 112 further receive a
`matrix signal input 116 which is a matrix column signal
`subset We of the matrix signal set W.
`All processing elements 102 in the first column 118
`receive a vector signal input 120 which is a vector row
`signal subset VR of the vector signal set V. All process-
`ing elements 102 in the first column 118 further receive
`a matrix signal input 122 which is a matrix row signal
`subset WR of the matrix signal set W.
`All processing elements 102 within the matrix 100
`further receive two clock signals. The