`119)
`United States Patent
`5,274,832
`[11] Patent Number:
`Dec. 28, 1993
`[45] Date of Patent:
`Khan
`
`UNAMANAA
`
`{54] SYSTOLIC ARRAY FOR
`MULTIDIMENSIONAL MATRIX
`COMPUTATIONS
`
`(75]
`
`Inventor:
`
`Emdadur R. Khan, San Jose, Calif.
`
`[73] Assignee: National Semiconductor Corporation,
`Santa Clara, Calif.
`
`[21] Appl. No.: 592,954
`[22] Filed:
`Oct. 4, 1990
`
`[51] Et, CUS vecccsssccsssscssssessssssenssseesssessese GOGF 15/347
`
`[52] US. CD. coeeecccccceccesesssessssseee 395/800; 364/271.2;
`364/258; 364/258.2; 364/276.8; 364/DIG. 1;
`364/754
`[58] Field of Search .........0..060. 364/754, 845, 728.05,
`364/841; 395/800
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`1/1985 Kung ounces, 364/754
`4,493,048
`7/1986 Bocker......
`.. 364/845
`4,603,398
`
`4,701,876 10/1987 McCanny......
`.- 364/728
`§,107,452
`4/1992 Karmarkar....
`. 364/754
`5,146,543
`9/1992 Vassiliadis ssotvtnarinesasenien 395/27
`OTHER PUBLICATIONS
`
`S. Y. Kung & J. N. Hwang, “Parallel Architecture for
`Artificial Neural Nets,” IJCNN, 1989, pp. I-165-II-
`-172.
`N.Ling & M. A. Bayoumi, “Algorithms for High Speed
`Multi-Dimensional Arithmetic and DSP Systolic Ar-
`rays,” Proceedings of the 1988 International Confer-
`ence on Parallel Processing. pp. 367-374.
`S. Y. Kung & J. N. Hwang, “A Unifying Algorith-
`m/Architecture for Aritificial Neural Networks,”
`IEEE Magazine, Feb. 1989, pp. 2505-2508.
`
`Primary Examiner—Eric Coleman
`Attorney, Agent, or Firm—Limbach & Limbach
`
`ABSTRACT
`[57]
`A multidimensional systolic array processor uses a mul-
`tidimensional array of systolically coupled processing
`elements to perform matrix-vector multiplication of
`matrix and vector signal sets. A two-dimensional array
`uses a PQ matrix (P rows and Q columns) of process-
`ing elements which are coupled to systolically process
`the signals, e.g. via multiplication and accumulation.
`The processing elements are coupled both row-to-row
`and column-to-column for pipeline processing within
`each row and each column,i.e. multidimensional pipe-
`lining,
`thereby increasing processing parallelism and
`speed. Interconnectivity of the processing elements is
`minimized by forming separate column and row signal
`subsets of the vector signal set which are coupled simul-
`taneously to each processing element in the first row
`and first column, respectively. Size of the processing
`elements is minimized by reducing local storage of ma-
`trix signal subsets within each processing element. Sepa-
`rate column and row signal subsets of the matrix signal
`set are formed and coupled into each processing ele-
`mentofthe first row and first column, respectively. As
`the matrix column and row signal subsets are systoli-
`cally processed and transferred row-to-row and co-
`lumn-to-column, respectively, each signal subset is re-
`ducedin size by one signal, thereby requiring the trans-
`fer and temporary local storage of successively smaller
`matrix signal subsets. A three-dimensional processor
`uses a PXQxXT array (T planes of P rows and Q col-
`umns) of processing elements which are coupled plane-
`to-plane-to-plane.
`
`45 Claims, 18 Drawing Sheets
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 1 of 18
`
`5,274,832
`
`Wy
`
`Wyn Wye 8 Wy
`
`Wor Wa
`
`Wok
`
`Vy
`Vo
`V3
`
`01
`09
`05
`
`W531
`.
`
`-
`
`W3K
`J °
`
`7} a]
`Vy
`Oy
`
`
`
`Wat==¥w20Wz 8° WK VK Oy
`
`
`
`(PRIOR ART)
`FIG.
`1
`
`nia
`
`V4
`
`Noy
`
`eeOO
`O
`(2.2)
`CO)
`
`2
`
`13
`
`ty
`
`1
`
`ny3 V3
`OC
`
`CO)
`
`nie Vy BD “not
`
`(W35
`
`no3
`
`CO)
`
`<———(W >
`
`@
`
`(Wusk)
`
`@
`
`(PRIOR ART)
`FIG. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`
`
`U.S. Patent
`
`Dec, 28, 1993
`
`Sheet 2 of 18
`
`5,274,832
`
`1=<
`3s $+ ee =
`=
`Fs
`=
`
`L>|
`seeese
`=
`x=
`=
`
`F =<
`
`a e
`
`°
`
`-_
`
`e = x
`
`m
`= o
`
`e
`
`=
`e =
`
`Er
`5cE
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 3 of 18
`
`5,274,832
`
`a p34
`=
`=
`Oo © #884 © = os gece S
`
`ac
`>
`
`PRODUCT
`
`OUTPUTS Or
`
`e e ®
`
`So
`
`©
`
`nm
`=
`
`wr
`nt eee
`=
`=
`=
`
`->
`
`m
`=z
`
`W44
`
`Wo4
`
`W314
`
`Wut
`
`Vi,
`
`(PRIORART)4
`
`FIG.
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 4 of 18
`
`5,274,832
`
`°eee"Tyeee
`
`Why255ONy28©Oye©+HyOlyee8Cyhy
`
`WZA°ee0-NZA°°°°ee°ZZA2M
`
`e°itA
`
`(LUYUOLid)
`
`My888OWNeee,6+NyINy
`
`Giae:[wy]++=[ie-1)]
`
`
`
`Hla..low
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 5 of 18
`
`5,274,832
`
`Vy Wa
`Wa
`
`Wat
`
`V9 0
`
`Wi,2
`M02
`
`Wy.2
`
`Vat W1.0+1 Varo 0
`Wyat2
`Watt
`Woa42
`
`Wy.a+2
`
`V K-0 Wika K-Q+1 0
`WiK-QH1
`Wok-
`WoK-41
`°
`°
`
`Wyk-a
`
`Wuk-o+
`
`
`
`
`PRODUCT
`OUTPUTS
`OT
`
`Vx
`
`0
`°
`°
`0
`
`(PRIOR ART)
`FIG.
`6A
`
`1K
`Wok
`
`Wuk
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 6 of 18
`
`5,274,832
`
`FIG.6B
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 7 of 18
`
`5,274,832
`
`A
`
`
`O(t=d),OdL'Ny...eee\'Ny
`140AeeeeetCu
`
` [isau(i-n)|we,[(a]---(ei]---(a]---Pafe)-ey
`02A°e
`
`
`
`
`
`OADdl'lye©eOd(I-L)'lys©eOdZ‘lye©ebtid'lyOd'byessO(I-d)'lyss*O'byseeltOlyUlyssethy
`eoy——yo”
`
`1)[aun]oe+
`Odi4(LuvYowd)G/“914
`edeeeZ1ihpoyOO1°°eé1
`°(Layold)WZ“OL
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`
`
`. D
`
`YPQ+2rp
`Vege
`WyPQ+1 i\P
`W2paHt
`. Wipout
`
`ow(r-i
`W2,(1P--1)0
`Wi(TP=10
`
`az(1P-1)0+1
`W2,(1P-1)Q41
`Wicm-tyos
`
`Vep+ija 9
`Li(P+1)0
`W2(P+1)9
`
`(PRIOR ART)
`FIG. 8A-1
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`U.S. Patent
`
`ec. 28, 1993
`
`Sheet 8 of 18
`
`5,274,832
`
`V¢r-1)Pq|Y(t-1)Pa+1
`0
`—_—V(r-1)paea 9
`will“post virngova
`Ws(T-1)PQ
`"2(T-1)POH 2(T--1)PO+0
`(S)
`Wa(t-~1)PQ
`wirTPO wy(-npa+a
`. Wuir-1p9 .
`
`PRODUCT
`OUTPUTS
`07
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 9 of 18
`
`5,274,832
`
`Vater 1.091 Voe2
`
`0
`wat?
`
`Wil,(P—1)a
`
`0
`Wy(P-1)0+1
`
`W1(P-1)Q
`Wo
`(p-
`4 "0 Watpotja
`Wil(P-O}+1
`
`KEY TO FIG. 8A
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 10 of 18
`
`5,274,832
`
`
`Oy
`
`PRODUCT
`OUTPUTS
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`SloLee
`alyztIo[zz
`
`u
`
`Sheet 11 of 18
`
`5,274,832
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 12 of 18
`
`5,274,832
`
`re°fou,e°e(dt,eee!Cytt,It,|<dqeeewaryaNbeA(
`dD,teoeLDQZOQND,
`
`d'laeeeCh,Uh,ilk»
`O1‘OLcy
`CHOeeeC+0Zc+,
`AaieeeOf,024DA
`
`JA
`
`e
`
`7(4
`
`ee
`
`eee
`
`ee
`
`3“
`
`Ne
`-
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 13 of 18
`
`5,274,832
`
`‘¢AyAAitA
`
`zeA0A
`
`oYA
`
`CPthy
`ZC'l'Ly
`UO-L'by
`
`UriZp
`CoS
`ONT
`
`VEU
`OUby
`Uhh
`
`1%
`ZZ
`wan
`
`ZVICy
`celoy
`VOCy
`
`City
`ZSy
`
`Ur'l'dy
`vS'dy
`
`US't'dA
`VZ'I'dy
`U'Ll'dy
`
`Lb
`
`Ou
`
`bey|=IAM
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 14 of 18
`
`5,274,832
`
`ZlA
`
`Cha
`
`rlA
`
`dla
`
`ZZ,
`
`CZA
`
`wala
`CClhy
`DLNb
`oerity
`Chhy
`Z'll'ly
`
`cZ2'ly
`CZ2'ly
`OZLy
`eevbCly
`tae
`Thehy
`
`c7C'ly
`CCLly
`OUby
`eeHVChy
`Chehy
`
`CZid'ly
`T2d'tMA
`O'ld'ly
`eeV'bed'by
`Cid'hy
`Uhd'hy
`
`Cl‘OW
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`che'ly|=Yad
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`WTCLOCK
`
`Sheet 15 of 18
`
`5,274,832
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`
`
`| U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 16 of 18
`
`5,274,832
`
`dm
`
`G01
`
`q901
`
`821
`
`7-4TTTTTTTTTT7=|Sse|fi11ee|34|ah|=}Yanai|_|eee||Ce||UIoleLoeOSWNOISWNOIS|YOLD9A
`XiMLVIN|aee(nna
`JAoaGt“OU
`qOLt801Bl
`
`O01}Dg0l
`
`JASa
`
`Dy0t
`
`um
`
`921
`
`010
`
`0901
`
`YA
`
`vel
`
`OVA
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`
`
`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 17 of 18
`
`5,274,832
`
`
`
`
`122
`112—— Wp
`Vp
`120
`
`104
`N
`| | 1,3
`106
`
`102
`
`eee
`eee
`
`FIG. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`
`
`
`
`ee
`
`ledleyle
`Le
`
`
`Frege
`AH48pI
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 19
`
`
`
`r|
`
`Petitioner Microsoft Corporation-Ex. 1020, p. 19
`
`
`
`1
`
`5,274,832
`
`SYSTOLIC ARRAY FOR MULTIDIMENSIONAL
`MATRIX COMPUTATIONS
`
`BACKGROUNDOF THE INVENTION
`1. Field of the invention
`The present invention relates to array processors, and
`in particular, to systolic array processors which process
`multiple signals in paralle] in multiple dimensions.
`2. Description of the Related Art
`Systolic processors, i.e. processors which systolically
`“pump,” or transfer, data from one processing element
`to another, are well known in the art. Systolic arrays
`have been used to increase the pipelined computing
`capability, and therefore the computing speed, of vari-
`ous types of signal processors.
`Systolic array processors are particularly useful for
`processing, e.g. multiplying, two signal sets, where the
`first signal set represents a matrix parameterset and the
`second signal set represents a vector parameterset. In
`other words, the first signal set represents a matrix pa-
`rameter set which can be represented as an M-by-K
`(“MxK") matrix having M rows and K columns of
`parameters, and the second signal set represents a Kx1
`vector having K rows and 1 column of parameters.
`Referring to FIG. 1, a representation of the matrix-
`vector multiplication of two such signal sets can be
`seen. The matrix signal set W has matrix signals Wz)
`and the vector signa] set V has vectorsignals Vy, where
`1 is an element of the set {1,2,3,...,M} and J is an
`elementof the set {1,2,3,...,K}. This can be expressed
`mathematically by the following formula:
`
`xz
`
`J= 1 Wry
`
`Oo;
`
`Such signal sets are also found in manyartificial neu-
`ral network models, including the Hopfield neural net-
`work model. Referring to FIG. 2, a simple artificial)
`neural network with its associated signal sets can be
`seen. Thefirst layer of neurons nj, y, or nodes, receives
`some form ofinput signals ly, and based thereon, gener-
`ates a numberof voltage signals Vy, which can be repre-
`sented as a voltage vector V.
`Coupling the respective voltage signals Vy to the
`second layer of neurons n2,7 are a number of scaling
`elements (e.g. “adaptive weights”), which introduce
`scaling, or “weigh,” signals Wy,
`for
`scaling or
`“weighting” the voltage signals Vy prior to their being
`received by the second layer neurons n2. It will be
`understood that, with respect to the subscripted nota-
`tion for representing the scaling or weighting signals
`Wz, the first subscripted character “I” represents the
`destination neuron n2,;in the second layer, and the sec-
`ond subscripted character “J” represents the source
`neuron ny of the voltage signal Vin the first layer.
`The simplest form of systolic processing array used
`for performing the matrix-vector multiplication of sig-
`nal sets, as discussed above,
`is one-dimensional. One
`type of one-dimensional systolic array is a “ring” sys-
`tolic array, shown in FIG. 3.
`The systolically coupled processing elements Ny are
`interconnected as shown, with signal flow represented
`by the arrows. First, the corresponding voltage signals
`Vj are initially coupled into their corresponding pro-
`cessing elements Ny. Then, following the application of
`each clock pulse (not shown, but commonto each pro-
`cessing element N,), the matrix signals W,,7 are sequen-
`
`2
`tially inputted to their corresponding processing ele-
`ment Ny, as shown. Therein, each matrix signal Wz,; is
`multiplied by its corresponding voltage signal Vy and
`accumulated, i.e. stored, within the processing element
`Ny.
`the foregoing is
`Following the next clock signal,
`repeated, with the voltage signals V, being transferred
`to subsequent processing elements Ny to be multiplied
`by the corresponding matrix signal Wy, therein. For
`example, the voltage signals Vz which are transferred
`between the processing elements N,are shown in paren-
`theses. This is repeated K — 1 times,i.e. for a total of K
`times, to produce the final matrix-vector product out-
`puts O;. The “ring” configuration facilitates multiple
`iterations of the matrix-vector products, a desirable
`feature used in the learning phase of anartificial neural
`network. Further discussions of the ring systolic array
`can be found in “Parallel Architectures for Artificial
`Neural Nets,” by S. Y. Kung and J. N. Hwang, JCNN
`1989, pp. II-165 through I-172.
`A second type of one-dimensional systolic array re-
`lies on a configuration in accordance with the
`“STAMS”(Systematic Transformation of Algorithms
`for Multidimensional Systolic arrays) technique. The
`STAMS techniqueis discussed in detail in “Algorithms
`for High Speed Multidimensional Arithmetic and DSP
`Systolic Arrays,” by N. Ling and M. A. Bayoumi, Pro-
`ceedings of the 1988 International Conference on Paral-
`lel Processing, Vol. I, pp. 367-374. An example of a
`one-dimensional STAMS systolic array is shown in
`FIG. 4. dimensional arrays is shorter and more process-
`ing is done in parallel. This three-dimensional configu-
`ration requires only T+ K—1 clock cycles.
`Even though the two-dimensional and three-dimen-
`sional STAMSsystolic array configurations discussed
`above provide improvements with respect to processing
`speed, minimal if any improvement is provided with
`respect to the number and complexity of the local or
`49 global interconnections required for inputting the ma-
`trix W and vector V signals. Furthermore, even though
`the one-dimensional ring systolic array already provides
`reasonable processing speed, its requisite global inter-
`connections are complex and impractical. Moreover, no
`improvements are provided by any of the foregoing
`arrays with respect to matrix signal Wy), Storage re-
`quirements.
`Moreover, the two- and three-dimensional STAMS
`systolic array configurations described above are not
`truly two- or three-dimensional, respectively. The two-
`dimensional array, as well as each two-dimensional
`array plane within the three-dimensional array, have
`their processing elements N4.z interconnected along
`one dimension only, e.g. left to right. Therefore, the
`systolic processing actually occurs in one dimension
`only. Thus, full multidimensional parallelism or pipelin-
`ing is not achieved and maximum processing speed,i.e.
`minimum processing time, cannot be achieved.
`It would be desirable to have a true multidimensional
`systolic array configuration providing true multidimen-
`sional pipeline operation to maximize processing speed.
`It would be further desirable to have such a multidimen-
`sional systolic processing array in which minimal global
`or local interconnects are required for inputting the
`matrix
`First, just as in the ring systolic array of FIG. 3, the
`voltage signals Vy are initially inputted into their re-
`spective processing elements Ny. Then, the matrix sig-
`
`45
`
`10
`
`20
`
`25
`
`35
`
`45
`
`&
`
`65
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`
`
`3
`nals W,y are inputted into the processing elements Ny,
`with each respective processing element Ny receiving
`one column of the matrix of matrix signals Wz, as
`shown. The weight-voltage products are summed with
`the corresponding weight-voltage products from the
`preceding processing element Ny—1 and then systoli-
`cally transferred to the next processing element N7¥+1,
`and the process continues.
`The inputting of the matrix signals W;, into each
`successive processing element N, is delayed by one
`additional clock pulse per processing element stage to
`allow for the delays associated with the systolic trans-
`ferring of the accumulated products. This delay can be
`accomplished by inputting zeros to a processing ele-
`ment Ny until the systolically transferred accumulated
`products begin to arrive. However, this delay adversely
`affects the processing speed. As compared to the ring
`systolic array of FIG. 3 which requires K clock cycles,
`the STAMS systolic array requires 2K — 1 clock cycles
`to obtain the product outputs O7 of this matrix-vector
`multiplication.
`A numberof problemsare associated with using these
`one-dimensional systolic arrays. One problem involves
`the inputting of the voltage signals Vy. If the voltages
`Vy are to be loaded simultaneously in parailel, global
`interconnects are required to accomplish this. If they
`are to be loaded sequentially in serial, numerous local
`interconnects are required, as well as K clock cycles.
`Another problem involves the inputting of the matrix
`signals Wy. If the matrix signals W;,7are stored locally
`within each processing element Ny, the processing ele-
`ments Ns must be large enough to provide sufficient
`storage, i.e. memory, therefor. On the other hand, if the
`matrix signals W7;, are not stored locally within each
`processing element Ny, but instead inputted as needed,
`the global interconnections necessary to do this become
`complex and impractical. Either many parallel
`input
`lines, e.g. a wide signal bus structure, or a large number
`of clock cycles must be provided.
`A third problem involves the amountof time required
`to perform the matrix-vector multiplication, ie. 2K —1
`clock cycles for the STAMSsystolic array. Although
`the ring systolic array requires only K clock cycles, the
`problem remains, as discussed immediately above, of
`providing either sufficient local matrix signal] storage or
`complex global interconnections.
`One approach to addressing these problemsofinter-
`connects, storage area and processing time involves the
`use of multidimensional systolic processing arrays. For
`example, parallelism,
`i.e. parallel processing, can be
`introduced by subdividing the matrix signals Wy;and
`vector signals Vy. This can be diagrammatically visual-
`ized as seen in FIGS. 5A-5B. This can be expressed
`mathematically by the following formula:
`
`Each row I of the matrix W is divided into P groups
`of Q signals Wzy. In other words, the first of the P
`groups of Q signals Wy;contains the matrix signals
`W1,1-W1,9. Similarly, the vector V is divided into P
`groups of Q voltages Vy. For example, the first of the P
`groups of Q voltages Vs includes the voltages V|-Vo.
`This can be visualized in even simpler form as shown in
`FIG. 5B.
`The processing of these P groups of Q signals Wz,
`Vy can be accomplished by using several one-
`
`5,274,832
`
`4
`dimensional STAMS systolic arrays, such as that shown
`in FIG. 4, in parallel, as shown in FIG. 6A. The opera-
`tion of each separate systolic array is in accordance
`with that described for the one-dimensional STAMS
`systolic array of FIG. 4 above, with the exception that
`only Q, rather than K, processing (i.e. clock) cycles are
`required for each systolic array to complete one sub-
`product. The subproducts of each array are then
`summed together to provide the final product outputs
`O;. Visualizing this systolic array configuration as two-
`dimensional is perhaps moreeasily done by referring to
`FIG.6B.
`This two-dimensional systolic array configuration is
`an improvement over the one-dimensional STAMS
`configuration, with respect to processing time. Process-
`ing time is reduced since each one-dimensional array,
`i.e. each pipeline of processors, within the two-dimen-
`sional array is shorter and more processing is done in
`parallel. This configuration requires only K+Q—1
`clock cycles to obtain the product outputs O; of the
`matrix-vector multiplication.
`Further improvement has been achieved by extend-
`ing the two-dimensional STAMS systolic array of FIG.
`6A to a three-dimensional systolic array. This can be
`done by further subdividing the matrix W and vector V
`signals into T groups of P groups of Q signals Wz, Vy.
`This can be visualized diagrammatically by referring to
`FIGS. 7A~-7B. This can be expressed mathematically by
`the following formula:
`
`T-1 P-1Q-1
`= 2
`z 5
`Oo; = G=0 E=0 F=0 W1POG+ QE+F+1¥PQG+Q0F+F+1
`
`20
`
`25
`
`30
`
`As seen in FIG. 7A, each row I of the matrix W and
`the vector V is divided into T groups, which in turn are
`divided into P groups of Q signals Wz, Vy. For exam-
`ple, the first of the P groups within the first of the T
`groups contain the matrix signals W1,:-W1,9 and the
`vector signals V}-Vg. FIG. 7B represents a more sim-
`plified depiction of this multiple subdivision of the ma-
`trix W and vector V signals.
`Referring to FIG. 8A,a realization of such a three-di-
`mensional]systolic array is illustrated. Two-dimensional
`systolic arrays, similar to that illustrated in FIG. 6A, are
`disposed as if on T parallel planes. The operation of
`each of the T two-dimensional systolic arrays is similar
`to that as s described above for FIG. 6A. The sub-
`product outputs of each of the T two-dimensional ar-
`Tays are summed together to produce the full product
`outputs O;. The three-dimensional nature of this array
`can perhaps be better visualized by referring to FIG.
`8B
`This three-dimensionat STAMS systolic array con-
`355
`figuration is an improvementover the two-dimensional
`configuration inasmuch as fewer processing, i.e. clock,
`oy=7s! 8s! wy v,
`
`= E=0 Fa0 1LQE+ Ft" OE+F+t
`cycles are required to complete each product output O,.
`Processing time is reduced since each one-dimensional
`array,
`i.e. each and vector signals. It would be still
`further desirable to have such a multidimensional sys-
`tolic processing array with minimal matrix signal stor-
`age requirements for each processing element.
`SUMMARYOF THE INVENTION
`
`45
`
`50
`
`60
`
`65
`
`A multidimensional systolic array processor in accor-
`dance with the present invention has an architecture
`which maximizes processing parallelism and minimizes
`global interconnections. Further, the present invention
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`
`
`5,274,832
`
`10
`
`20
`
`25
`
`30
`
`5
`minimizes local matrix signal storage requirements
`within each processing element.
`Thepresent invention maximizes processing parallel-
`ism by interconnecting its processing elements along
`multiple dimensions. Therefore,
`systolic processing
`occurs along multiple dimensions. For example, a two-
`dimensional systolic array processor in accordance with
`the present invention includes a PxQ matrix having P
`rows and Q columns of processing elements, each of
`which is systolically coupled row-to-row and column-
`to-column for full pipeline processing within each row
`and each column. A three-dimensional systolic array
`processor has a PxQxT array with T planes of P rows
`and Q columnsof processing elements, each of which is
`systolically coupled plane-to-plane-to-plane for
`full
`pipeline processing.
`Thepresent invention minimizes global interconnec-
`tions of the processing elements. For the two-dimen-
`sicnal case, appropriate matrix and vector signal subsets
`are inputted to only one row and one column of the
`two-dimensional processing array. These matrix and
`vectorsignal subsets are specifically formed so that they
`need to be inputted to only one row and one column,
`and yet still be properly processed systolically along all
`dimensions within the array.
`For the three-dimensional case, appropriate matrix
`and vector signa] subsets are inputted to three perpen-
`dicular planes of the three-dimensional processing ar-
`ray. For higher-dimensional cases, appropriate matrix
`and vector signal subsets are similarly inputted to the
`higher-dimensional processing arrays.
`The present invention minimizes local matrix signal
`storage requirements by inputting specifically formed
`matrix signal subsets to only one row and one columnof 35
`the two-dimensional processing array, and to three
`perpendicular planes of the three-dimensional process-
`ing array. These matrix signal subsets are formed to
`allow thesizes of the matrix signal subsets to be reduced
`as they are systolically transferred to subsequent pro-
`cessing elements along each dimension within the array.
`Asthe matrix signal subsets decrease in size through the
`array, the local storage, e.g. memory, needed for tempo-
`rarily storing each matrix signal subset within each
`processing elementis successively reduced. Processing
`speedis not sacrificed since the matrix signal subsets are
`transferred to the subsequent processing element at a
`clock rate higher than that used for systolically transfer-
`ring the vector signa] subsets.
`These and other objectives, features and advantages
`of the present invention will be understood upon con-
`sideration of the following detailed description of the
`invention and the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`6
`FIGS. 7A-7Billustrate diagrammatically a conven-
`tional matrix-vector multiplication, wherein the matrix
`and vector of FIGS. 5A-5B are further subdivided into
`matrix and vector subsets, respectively.
`FIGS. 8A-8Billustrate a conventional quasi three-di-
`mensional systolic array.
`FIG. 9 illustrates the Layer 1 and 2 neurons of FIG.
`2 reconfigured as two-dimensional! neural arrays in ac-
`cordance with the present invention.
`FIG.10 illustrates diagrammatically the reconfigura-
`tion of the one-dimensional vector signal set of FIG. 1
`into a two-dimensional vector signal set in accordance
`with the present invention.
`FIG. 11 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector columnsig-
`nal subsets in accordance with the present invention.
`FIG. 12 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector row signal
`subsets in accordance with the present invention.
`FIG, 13illustrates a block diagram of a two-dimen-
`sional systolic array processor in accordance with the
`present invention.
`FIG. 14illustrates a single processing element of the
`two-dimensional processor of FIG. 13.
`FIG. 15 illustrates a functional block diagram of a
`single processing element of the two-dimensional pro-
`cessor of FIG.13.
`FIG. 16 illustrates the reduced local matrix signal
`storage requirements of the two-dimensional processor
`of FIG. 13.
`FIG. 17 further illustrates the reduced local matrix
`signal storage requirements of the two-dimensional
`processor of FIG. 13.
`FIG. 18 illustrates a block diagram of a three-dimen-
`sional systolic array processor in accordance with the
`present invention.
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`45
`
`350
`
`55
`
`two-layer
`
`1 illustrates diagrammatically a conventional
`FIG.
`matrix-vector multiplication.
`FIG. 2 illustrate a simple conventional
`artificial neural network.
`FIG. 3 illustrates a conventional one-dimensional
`“ring” systolic array.
`FIG. 4 illustrates an alternative conventional one-
`dimensional systolic array.
`FIGS. 5A-5B illustrate diagrammatically a conven-
`tional matrix-vector multiplication, wherein the matrix
`and vector are subdivided into matrix and
`FIGS. 6A-6Billustrate a conventional quasi two-di-
`mensional systolic array.
`
`60
`
`65
`
`Referring to FIG. 9, the Layer 1 and 2 neurons nj, n2
`of an artificial neural network (as shown in FIG. 1) are
`reconfigured into two-dimensional neural arrays. The
`original input signals ly now have double subscripted
`notation to reflect
`the two-dimensional set of input
`signals Iy.z. Similarly, the original voltage signals Vz
`now have double subscripted notation to reflect
`the
`two-dimensionality of the set of voltage signals Vy,z.
`Indicated in brackets for some of the Layer 1 neuronsin
`FIG. 9 are the original Layer 1 neuron and voltage
`identifiers (the remaining identifiers being left out for
`clarity).
`The Layer 2 neurons also now have double sub-
`scripted notation to reflect the new two-dimensionality
`of the set of Layer 2 neurons n4_8. Indicated in brackets
`for some of the Layer 2 neuronsare the original Layer
`2 neuron identifiers (the remaining identifiers being left
`out for clarity).
`The original matrix, e.g. “weight,” signals W;, now
`have quadruple subscripted notation to reflect their new
`multidimensionality, i.e. W.4,.y,z. The first subscripted
`pair of characters “‘A,B” represents the destination neu-
`ron nya in the second layer. The second subscripted
`pair of characters “Y,Z" represents the source of the
`voltage signal Vyzin thefirst layer. Indicated in brack-
`ets for someof the matrix signals W4\5.y,zare the origi-
`nal matrix signal identifiers Wy, (the remaining identi-
`fiers being left out for clarity).
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`
`
`7
`It should be understood that the representations of
`the Layer 1 and 2 neurons, along with their subscripted
`notations, were selected arbitrarily. They can be recon-
`figured as desired, provided that
`the resulting sub-
`scripted notation for the matrix signals be consistent
`therewith.
`It should be further understood that it has been as-
`sumed for the sake of simplicity that M=K for the
`reconfigured array of Layer 2 neuronsas represented in
`FIG. 9. However, this is not necessary to the present
`invention. Ideally, the numbers of rows and columns
`should be equal, i.e. P=Q. This is to maximize the im-
`provements in processing speed and to minimize the
`local matrix signal storage requirements in accordance
`with the present invention. However,
`if K cannot be expressed as a square,i.e. if P=+Q, then
`the numbers of neurons in both Layers 1 and 2 can be
`approximated to the nearest square. Extra processing
`cycles would be required because of such an approxima-
`tion, but significant improvements in processing speed
`and local matrix signal storage requirements would still
`be realized.
`Referring to FIG. 10, the reconfiguration of the one-
`dimensional vector signal set Vzof FIG. 1 into a two-di-
`mensional vector signal set Vy,z can be understood.
`The one-dimensional vector signal set Vy is initially
`mapped into a two-dimensional vector signal set with
`the original subscripted notation left intact. This two-di-
`mensional vector signal set is then given new doubie
`subscripted notation to reflect the two-dimensionality
`of the reconfigured vector signal set V y,z.
`Also indicated in FIG. 10 (with dashed lines within
`the two-dimensional vector signal sets) are two vector
`signa] subsets. One is referred to as the vector column
`signal subset Vc and the otheris referred to as the vec-
`tor row signal subset Vr. These vector signal subsets
`Vo. Vrare multiplied by the matrix signals W4,2.y,z as
`represented in FIG. 9. The matrix signals W42.y,z are
`separated into corresponding matrix column Wc and
`row Wpsignal subsets. The vector column Vc and row
`Vr signal subsets are multiplied by the matrix column
`Weand row Wrsignal subsets, respectively, as shown
`in FIGS. 11 and 12. Therefore, the matrix-vector prod-
`ucts O;are identified according to the following formu-
`las:
`
`O;=Oc+OR
`
`where:
`
`OF Wanyzv.
`oc= 2
`CH xe yox ABYZ"
`Zz
`Ja Y-Z-P+ 2 P-24+2)
`X= Z, for odd Z
`= Z + 1, for even Z;
`
`and
`
`0.
`
`V,
`g W,
`£
`R= Yo1Z2=x ABYZ' I
`Yy
`J=Z-Y¥-QO+ 2 @-¥+2)
`X = Y, for even Y
`
`5,274,832
`
`8
`-continued
`= Y + 1, for odd YF.
`
`13
`
`20
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`Based upon the foregoing. the vector Vc, Vr and
`matrix Wo, Wp signal subsets can be identified accord-
`ing to the following formulas:
`
`where:
`
`Zz
`J=Y-Z—-—P+ X (P-Z+2)
`Z=)
`
`Z=1,3,3,...,9
`
`YSRMX41,¥42,...,P
`
`X = Zfor odd Z
`= Z+ 1, for even Z;
`
`and
`
`Vr = Vyz= ¥3
`
`Wr= WaByz
`where:
`
`Yr
`J=Z-¥-O+ 3 @-¥+d
`Y=o1,2,3,...,P
`
`ZX X4+1,X¥4+2,....9
`
`X = ¥, foreven Y
`= Y+ 1, for odd ¥
`
`Referring to FIG. 13, a two-dimensional systolic
`processing array 100 in accordance with the present
`invention includes K processing elements 102, desig-
`nated by N..7 where A is an elementofthe set {1,2,3,.
`.
`. P}, B is an element of the set {1,2,3,... ,Q} and
`K=PQ. The processing elements 102 are intercon-
`nected in a two-dimensional matrix 100 having P rows
`and Q columns.
`The processing elements 102 are mutually coupled
`column-to-column via matrix row subset signallines 104
`and vector row subset signal
`lines 106. These signal
`lines 104, 106 provide the means by which the matrix
`WR and vector Vr row signal subsets are systolically
`transferred column-to-column among the processing
`elements 102. The processing elements 102 are further
`mutually coupled row-to-row via matrix column subset
`signal lines 108 and vector column subset signal lines
`110. It is by these signal lines 108, 110 that the matrix
`Wceand vector Vc signal subsets are systolically trans-
`ferred row-to-row among the processing elements 102.
`All processing elements 102 in the first row 112 re-
`ceive a vector signal input 114 whichis a vector column
`signal subset Vc of the vector signal set V. All process-
`ing elements 102 in the first row 112 further receive a
`matrix signal input 116 which is a matrix columnsignal
`subset Wc of the matrix signal set W.
`All process