IPR2018-01606, No. 1020 Exhibit - US Patent 5,274,832 to Khan (P.T.A.B. Sep. 6, 2018)

US005274832A
`119)
`United States Patent
`5,274,832
`[11] Patent Number:
`Dec. 28, 1993
`[45] Date of Patent:
`Khan
`
`UNAMANAA
`
`{54] SYSTOLIC ARRAY FOR
`MULTIDIMENSIONAL MATRIX
`COMPUTATIONS
`
`(75]
`
`Inventor:
`
`Emdadur R. Khan, San Jose, Calif.
`
`[73] Assignee: National Semiconductor Corporation,
`Santa Clara, Calif.
`
`[21] Appl. No.: 592,954
`[22] Filed:
`Oct. 4, 1990
`
`[51] Et, CUS vecccsssccsssscssssessssssenssseesssessese GOGF 15/347
`
`[52] US. CD. coeeecccccceccesesssessssseee 395/800; 364/271.2;
`364/258; 364/258.2; 364/276.8; 364/DIG. 1;
`364/754
`[58] Field of Search .........0..060. 364/754, 845, 728.05,
`364/841; 395/800
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`1/1985 Kung ounces, 364/754
`4,493,048
`7/1986 Bocker......
`.. 364/845
`4,603,398
`
`4,701,876 10/1987 McCanny......
`.- 364/728
`§,107,452
`4/1992 Karmarkar....
`. 364/754
`5,146,543
`9/1992 Vassiliadis ssotvtnarinesasenien 395/27
`OTHER PUBLICATIONS
`
`S. Y. Kung & J. N. Hwang, “Parallel Architecture for
`Artificial Neural Nets,” IJCNN, 1989, pp. I-165-II-
`-172.
`N.Ling & M. A. Bayoumi, “Algorithms for High Speed
`Multi-Dimensional Arithmetic and DSP Systolic Ar-
`rays,” Proceedings of the 1988 International Confer-
`ence on Parallel Processing. pp. 367-374.
`S. Y. Kung & J. N. Hwang, “A Unifying Algorith-
`m/Architecture for Aritificial Neural Networks,”
`IEEE Magazine, Feb. 1989, pp. 2505-2508.
`
`Primary Examiner—Eric Coleman
`Attorney, Agent, or Firm—Limbach & Limbach
`
`ABSTRACT
`[57]
`A multidimensional systolic array processor uses a mul-
`tidimensional array of systolically coupled processing
`elements to perform matrix-vector multiplication of
`matrix and vector signal sets. A two-dimensional array
`uses a PQ matrix (P rows and Q columns) of process-
`ing elements which are coupled to systolically process
`the signals, e.g. via multiplication and accumulation.
`The processing elements are coupled both row-to-row
`and column-to-column for pipeline processing within
`each row and each column,i.e. multidimensional pipe-
`lining,
`thereby increasing processing parallelism and
`speed. Interconnectivity of the processing elements is
`minimized by forming separate column and row signal
`subsets of the vector signal set which are coupled simul-
`taneously to each processing element in the first row
`and first column, respectively. Size of the processing
`elements is minimized by reducing local storage of ma-
`trix signal subsets within each processing element. Sepa-
`rate column and row signal subsets of the matrix signal
`set are formed and coupled into each processing ele-
`mentofthe first row and first column, respectively. As
`the matrix column and row signal subsets are systoli-
`cally processed and transferred row-to-row and co-
`lumn-to-column, respectively, each signal subset is re-
`ducedin size by one signal, thereby requiring the trans-
`fer and temporary local storage of successively smaller
`matrix signal subsets. A three-dimensional processor
`uses a PXQxXT array (T planes of P rows and Q col-
`umns) of processing elements which are coupled plane-
`to-plane-to-plane.
`
`45 Claims, 18 Drawing Sheets
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 1
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 1 of 18
`
`5,274,832
`
`Wy
`
`Wyn Wye 8 Wy
`
`Wor Wa
`
`Wok
`
`Vy
`Vo
`V3
`
`01
`09
`05
`
`W531
`.
`
`-
`
`W3K
`J °
`
`7} a]
`Vy
`Oy
`
`
`
`Wat==¥w20Wz 8° WK VK Oy
`
`
`
`(PRIOR ART)
`FIG.
`1
`
`nia
`
`V4
`
`Noy
`
`eeOO
`O
`(2.2)
`CO)
`
`2
`
`13
`
`ty
`
`1
`
`ny3 V3
`OC
`
`CO)
`
`nie Vy BD “not
`
`(W35
`
`no3
`
`CO)
`
`<———(W >
`
`@
`
`(Wusk)
`
`@
`
`(PRIOR ART)
`FIG. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 2
`
`

`U.S. Patent
`
`Dec, 28, 1993
`
`Sheet 2 of 18
`
`5,274,832
`
`1=<
`3s $+ ee =
`=
`Fs
`=
`
`L>|
`seeese
`=
`x=
`=
`
`F =<
`
`a e
`
`°
`
`-_
`
`e = x
`
`m
`= o
`
`e
`
`=
`e =
`
`Er
`5cE
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 3
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 3 of 18
`
`5,274,832
`
`a p34
`=
`=
`Oo © #884 © = os gece S
`
`ac
`>
`
`PRODUCT
`
`OUTPUTS Or
`
`e e ®
`
`So
`
`©
`
`nm
`=
`
`wr
`nt eee
`=
`=
`=
`
`->
`
`m
`=z
`
`W44
`
`Wo4
`
`W314
`
`Wut
`
`Vi,
`
`(PRIORART)4
`
`FIG.
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 4
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 4 of 18
`
`5,274,832
`
`°eee"Tyeee
`
`Why255ONy28©Oye©+HyOlyee8Cyhy
`
`WZA°ee0-NZA°°°°ee°ZZA2M
`
`e°itA
`
`(LUYUOLid)
`
`My888OWNeee,6+NyINy
`
`Giae:[wy]++=[ie-1)]
`
`
`
`Hla..low
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 5
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 5 of 18
`
`5,274,832
`
`Vy Wa
`Wa
`
`Wat
`
`V9 0
`
`Wi,2
`M02
`
`Wy.2
`
`Vat W1.0+1 Varo 0
`Wyat2
`Watt
`Woa42
`
`Wy.a+2
`
`V K-0 Wika K-Q+1 0
`WiK-QH1
`Wok-
`WoK-41
`°
`°
`
`Wyk-a
`
`Wuk-o+
`
`
`
`
`PRODUCT
`OUTPUTS
`OT
`
`Vx
`
`0
`°
`°
`0
`
`(PRIOR ART)
`FIG.
`6A
`
`1K
`Wok
`
`Wuk
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 6
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 6 of 18
`
`5,274,832
`
`FIG.6B
`
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 7
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 7 of 18
`
`5,274,832
`
`A
`
`
`O(t=d),OdL'Ny...eee\'Ny
`140AeeeeetCu
`
` [isau(i-n)|we,[(a]---(ei]---(a]---Pafe)-ey
`02A°e
`
`
`
`
`
`OADdl'lye©eOd(I-L)'lys©eOdZ‘lye©ebtid'lyOd'byessO(I-d)'lyss*O'byseeltOlyUlyssethy
`eoy——yo”
`
`1)[aun]oe+
`Odi4(LuvYowd)G/“914
`edeeeZ1ihpoyOO1°°eé1
`°(Layold)WZ“OL
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 8
`
`

`. D
`
`YPQ+2rp
`Vege
`WyPQ+1 i\P
`W2paHt
`. Wipout
`
`ow(r-i
`W2,(1P--1)0
`Wi(TP=10
`
`az(1P-1)0+1
`W2,(1P-1)Q41
`Wicm-tyos
`
`Vep+ija 9
`Li(P+1)0
`W2(P+1)9
`
`(PRIOR ART)
`FIG. 8A-1
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 9
`
`U.S. Patent
`
`ec. 28, 1993
`
`Sheet 8 of 18
`
`5,274,832
`
`V¢r-1)Pq|Y(t-1)Pa+1
`0
`—_—V(r-1)paea 9
`will“post virngova
`Ws(T-1)PQ
`"2(T-1)POH 2(T--1)PO+0
`(S)
`Wa(t-~1)PQ
`wirTPO wy(-npa+a
`. Wuir-1p9 .
`
`PRODUCT
`OUTPUTS
`07
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 9 of 18
`
`5,274,832
`
`Vater 1.091 Voe2
`
`0
`wat?
`
`Wil,(P—1)a
`
`0
`Wy(P-1)0+1
`
`W1(P-1)Q
`Wo
`(p-
`4 "0 Watpotja
`Wil(P-O}+1
`
`KEY TO FIG. 8A
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 10
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 10 of 18
`
`5,274,832
`
`
`Oy
`
`PRODUCT
`OUTPUTS
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 11
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`SloLee
`alyztIo[zz
`
`u
`
`Sheet 11 of 18
`
`5,274,832
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 12
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 12 of 18
`
`5,274,832
`
`re°fou,e°e(dt,eee!Cytt,It,|<dqeeewaryaNbeA(
`dD,teoeLDQZOQND,
`
`d'laeeeCh,Uh,ilk»
`O1‘OLcy
`CHOeeeC+0Zc+,
`AaieeeOf,024DA
`
`JA
`
`e
`
`7(4
`
`ee
`
`eee
`
`ee
`
`3“
`
`Ne
`-
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 13
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 13 of 18
`
`5,274,832
`
`‘¢AyAAitA
`
`zeA0A
`
`oYA
`
`CPthy
`ZC'l'Ly
`UO-L'by
`
`UriZp
`CoS
`ONT
`
`VEU
`OUby
`Uhh
`
`1%
`ZZ
`wan
`
`ZVICy
`celoy
`VOCy
`
`City
`ZSy
`
`Ur'l'dy
`vS'dy
`
`US't'dA
`VZ'I'dy
`U'Ll'dy
`
`Lb
`
`Ou
`
`bey|=IAM
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 14
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 14 of 18
`
`5,274,832
`
`ZlA
`
`Cha
`
`rlA
`
`dla
`
`ZZ,
`
`CZA
`
`wala
`CClhy
`DLNb
`oerity
`Chhy
`Z'll'ly
`
`cZ2'ly
`CZ2'ly
`OZLy
`eevbCly
`tae
`Thehy
`
`c7C'ly
`CCLly
`OUby
`eeHVChy
`Chehy
`
`CZid'ly
`T2d'tMA
`O'ld'ly
`eeV'bed'by
`Cid'hy
`Uhd'hy
`
`Cl‘OW
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`che'ly|=Yad
`Petitioner Microsoft Corporation - Ex. 1020, p. 15
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`WTCLOCK
`
`Sheet 15 of 18
`
`5,274,832
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 16
`
`

`| U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 16 of 18
`
`5,274,832
`
`dm
`
`G01
`
`q901
`
`821
`
`7-4TTTTTTTTTT7=|Sse|fi11ee|34|ah|=}Yanai|_|eee||Ce||UIoleLoeOSWNOISWNOIS|YOLD9A
`XiMLVIN|aee(nna
`JAoaGt“OU
`qOLt801Bl
`
`O01}Dg0l
`
`JASa
`
`Dy0t
`
`um
`
`921
`
`010
`
`0901
`
`YA
`
`vel
`
`OVA
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 17
`
`

`U.S. Patent
`
`Dec. 28, 1993
`
`Sheet 17 of 18
`
`5,274,832
`
`
`
`
`122
`112—— Wp
`Vp
`120
`
`104
`N
`| | 1,3
`106
`
`102
`
`eee
`eee
`
`FIG. 17
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 18
`
`

`
`
`ee
`
`ledleyle
`Le
`
`
`Frege
`AH48pI
`
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 19
`
`
`
`r|
`
`Petitioner Microsoft Corporation-Ex. 1020, p. 19
`
`

`1
`
`5,274,832
`
`SYSTOLIC ARRAY FOR MULTIDIMENSIONAL
`MATRIX COMPUTATIONS
`
`BACKGROUNDOF THE INVENTION
`1. Field of the invention
`The present invention relates to array processors, and
`in particular, to systolic array processors which process
`multiple signals in paralle] in multiple dimensions.
`2. Description of the Related Art
`Systolic processors, i.e. processors which systolically
`“pump,” or transfer, data from one processing element
`to another, are well known in the art. Systolic arrays
`have been used to increase the pipelined computing
`capability, and therefore the computing speed, of vari-
`ous types of signal processors.
`Systolic array processors are particularly useful for
`processing, e.g. multiplying, two signal sets, where the
`first signal set represents a matrix parameterset and the
`second signal set represents a vector parameterset. In
`other words, the first signal set represents a matrix pa-
`rameter set which can be represented as an M-by-K
`(“MxK") matrix having M rows and K columns of
`parameters, and the second signal set represents a Kx1
`vector having K rows and 1 column of parameters.
`Referring to FIG. 1, a representation of the matrix-
`vector multiplication of two such signal sets can be
`seen. The matrix signal set W has matrix signals Wz)
`and the vector signa] set V has vectorsignals Vy, where
`1 is an element of the set {1,2,3,...,M} and J is an
`elementof the set {1,2,3,...,K}. This can be expressed
`mathematically by the following formula:
`
`xz
`
`J= 1 Wry
`
`Oo;
`
`Such signal sets are also found in manyartificial neu-
`ral network models, including the Hopfield neural net-
`work model. Referring to FIG. 2, a simple artificial)
`neural network with its associated signal sets can be
`seen. Thefirst layer of neurons nj, y, or nodes, receives
`some form ofinput signals ly, and based thereon, gener-
`ates a numberof voltage signals Vy, which can be repre-
`sented as a voltage vector V.
`Coupling the respective voltage signals Vy to the
`second layer of neurons n2,7 are a number of scaling
`elements (e.g. “adaptive weights”), which introduce
`scaling, or “weigh,” signals Wy,
`for
`scaling or
`“weighting” the voltage signals Vy prior to their being
`received by the second layer neurons n2. It will be
`understood that, with respect to the subscripted nota-
`tion for representing the scaling or weighting signals
`Wz, the first subscripted character “I” represents the
`destination neuron n2,;in the second layer, and the sec-
`ond subscripted character “J” represents the source
`neuron ny of the voltage signal Vin the first layer.
`The simplest form of systolic processing array used
`for performing the matrix-vector multiplication of sig-
`nal sets, as discussed above,
`is one-dimensional. One
`type of one-dimensional systolic array is a “ring” sys-
`tolic array, shown in FIG. 3.
`The systolically coupled processing elements Ny are
`interconnected as shown, with signal flow represented
`by the arrows. First, the corresponding voltage signals
`Vj are initially coupled into their corresponding pro-
`cessing elements Ny. Then, following the application of
`each clock pulse (not shown, but commonto each pro-
`cessing element N,), the matrix signals W,,7 are sequen-
`
`2
`tially inputted to their corresponding processing ele-
`ment Ny, as shown. Therein, each matrix signal Wz,; is
`multiplied by its corresponding voltage signal Vy and
`accumulated, i.e. stored, within the processing element
`Ny.
`the foregoing is
`Following the next clock signal,
`repeated, with the voltage signals V, being transferred
`to subsequent processing elements Ny to be multiplied
`by the corresponding matrix signal Wy, therein. For
`example, the voltage signals Vz which are transferred
`between the processing elements N,are shown in paren-
`theses. This is repeated K — 1 times,i.e. for a total of K
`times, to produce the final matrix-vector product out-
`puts O;. The “ring” configuration facilitates multiple
`iterations of the matrix-vector products, a desirable
`feature used in the learning phase of anartificial neural
`network. Further discussions of the ring systolic array
`can be found in “Parallel Architectures for Artificial
`Neural Nets,” by S. Y. Kung and J. N. Hwang, JCNN
`1989, pp. II-165 through I-172.
`A second type of one-dimensional systolic array re-
`lies on a configuration in accordance with the
`“STAMS”(Systematic Transformation of Algorithms
`for Multidimensional Systolic arrays) technique. The
`STAMS techniqueis discussed in detail in “Algorithms
`for High Speed Multidimensional Arithmetic and DSP
`Systolic Arrays,” by N. Ling and M. A. Bayoumi, Pro-
`ceedings of the 1988 International Conference on Paral-
`lel Processing, Vol. I, pp. 367-374. An example of a
`one-dimensional STAMS systolic array is shown in
`FIG. 4. dimensional arrays is shorter and more process-
`ing is done in parallel. This three-dimensional configu-
`ration requires only T+ K—1 clock cycles.
`Even though the two-dimensional and three-dimen-
`sional STAMSsystolic array configurations discussed
`above provide improvements with respect to processing
`speed, minimal if any improvement is provided with
`respect to the number and complexity of the local or
`49 global interconnections required for inputting the ma-
`trix W and vector V signals. Furthermore, even though
`the one-dimensional ring systolic array already provides
`reasonable processing speed, its requisite global inter-
`connections are complex and impractical. Moreover, no
`improvements are provided by any of the foregoing
`arrays with respect to matrix signal Wy), Storage re-
`quirements.
`Moreover, the two- and three-dimensional STAMS
`systolic array configurations described above are not
`truly two- or three-dimensional, respectively. The two-
`dimensional array, as well as each two-dimensional
`array plane within the three-dimensional array, have
`their processing elements N4.z interconnected along
`one dimension only, e.g. left to right. Therefore, the
`systolic processing actually occurs in one dimension
`only. Thus, full multidimensional parallelism or pipelin-
`ing is not achieved and maximum processing speed,i.e.
`minimum processing time, cannot be achieved.
`It would be desirable to have a true multidimensional
`systolic array configuration providing true multidimen-
`sional pipeline operation to maximize processing speed.
`It would be further desirable to have such a multidimen-
`sional systolic processing array in which minimal global
`or local interconnects are required for inputting the
`matrix
`First, just as in the ring systolic array of FIG. 3, the
`voltage signals Vy are initially inputted into their re-
`spective processing elements Ny. Then, the matrix sig-
`
`45
`
`10
`
`20
`
`25
`
`35
`
`45
`
`&
`
`65
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 20
`
`

`3
`nals W,y are inputted into the processing elements Ny,
`with each respective processing element Ny receiving
`one column of the matrix of matrix signals Wz, as
`shown. The weight-voltage products are summed with
`the corresponding weight-voltage products from the
`preceding processing element Ny—1 and then systoli-
`cally transferred to the next processing element N7¥+1,
`and the process continues.
`The inputting of the matrix signals W;, into each
`successive processing element N, is delayed by one
`additional clock pulse per processing element stage to
`allow for the delays associated with the systolic trans-
`ferring of the accumulated products. This delay can be
`accomplished by inputting zeros to a processing ele-
`ment Ny until the systolically transferred accumulated
`products begin to arrive. However, this delay adversely
`affects the processing speed. As compared to the ring
`systolic array of FIG. 3 which requires K clock cycles,
`the STAMS systolic array requires 2K — 1 clock cycles
`to obtain the product outputs O7 of this matrix-vector
`multiplication.
`A numberof problemsare associated with using these
`one-dimensional systolic arrays. One problem involves
`the inputting of the voltage signals Vy. If the voltages
`Vy are to be loaded simultaneously in parailel, global
`interconnects are required to accomplish this. If they
`are to be loaded sequentially in serial, numerous local
`interconnects are required, as well as K clock cycles.
`Another problem involves the inputting of the matrix
`signals Wy. If the matrix signals W;,7are stored locally
`within each processing element Ny, the processing ele-
`ments Ns must be large enough to provide sufficient
`storage, i.e. memory, therefor. On the other hand, if the
`matrix signals W7;, are not stored locally within each
`processing element Ny, but instead inputted as needed,
`the global interconnections necessary to do this become
`complex and impractical. Either many parallel
`input
`lines, e.g. a wide signal bus structure, or a large number
`of clock cycles must be provided.
`A third problem involves the amountof time required
`to perform the matrix-vector multiplication, ie. 2K —1
`clock cycles for the STAMSsystolic array. Although
`the ring systolic array requires only K clock cycles, the
`problem remains, as discussed immediately above, of
`providing either sufficient local matrix signal] storage or
`complex global interconnections.
`One approach to addressing these problemsofinter-
`connects, storage area and processing time involves the
`use of multidimensional systolic processing arrays. For
`example, parallelism,
`i.e. parallel processing, can be
`introduced by subdividing the matrix signals Wy;and
`vector signals Vy. This can be diagrammatically visual-
`ized as seen in FIGS. 5A-5B. This can be expressed
`mathematically by the following formula:
`
`Each row I of the matrix W is divided into P groups
`of Q signals Wzy. In other words, the first of the P
`groups of Q signals Wy;contains the matrix signals
`W1,1-W1,9. Similarly, the vector V is divided into P
`groups of Q voltages Vy. For example, the first of the P
`groups of Q voltages Vs includes the voltages V|-Vo.
`This can be visualized in even simpler form as shown in
`FIG. 5B.
`The processing of these P groups of Q signals Wz,
`Vy can be accomplished by using several one-
`
`5,274,832
`
`4
`dimensional STAMS systolic arrays, such as that shown
`in FIG. 4, in parallel, as shown in FIG. 6A. The opera-
`tion of each separate systolic array is in accordance
`with that described for the one-dimensional STAMS
`systolic array of FIG. 4 above, with the exception that
`only Q, rather than K, processing (i.e. clock) cycles are
`required for each systolic array to complete one sub-
`product. The subproducts of each array are then
`summed together to provide the final product outputs
`O;. Visualizing this systolic array configuration as two-
`dimensional is perhaps moreeasily done by referring to
`FIG.6B.
`This two-dimensional systolic array configuration is
`an improvement over the one-dimensional STAMS
`configuration, with respect to processing time. Process-
`ing time is reduced since each one-dimensional array,
`i.e. each pipeline of processors, within the two-dimen-
`sional array is shorter and more processing is done in
`parallel. This configuration requires only K+Q—1
`clock cycles to obtain the product outputs O; of the
`matrix-vector multiplication.
`Further improvement has been achieved by extend-
`ing the two-dimensional STAMS systolic array of FIG.
`6A to a three-dimensional systolic array. This can be
`done by further subdividing the matrix W and vector V
`signals into T groups of P groups of Q signals Wz, Vy.
`This can be visualized diagrammatically by referring to
`FIGS. 7A~-7B. This can be expressed mathematically by
`the following formula:
`
`T-1 P-1Q-1
`= 2
`z 5
`Oo; = G=0 E=0 F=0 W1POG+ QE+F+1¥PQG+Q0F+F+1
`
`20
`
`25
`
`30
`
`As seen in FIG. 7A, each row I of the matrix W and
`the vector V is divided into T groups, which in turn are
`divided into P groups of Q signals Wz, Vy. For exam-
`ple, the first of the P groups within the first of the T
`groups contain the matrix signals W1,:-W1,9 and the
`vector signals V}-Vg. FIG. 7B represents a more sim-
`plified depiction of this multiple subdivision of the ma-
`trix W and vector V signals.
`Referring to FIG. 8A,a realization of such a three-di-
`mensional]systolic array is illustrated. Two-dimensional
`systolic arrays, similar to that illustrated in FIG. 6A, are
`disposed as if on T parallel planes. The operation of
`each of the T two-dimensional systolic arrays is similar
`to that as s described above for FIG. 6A. The sub-
`product outputs of each of the T two-dimensional ar-
`Tays are summed together to produce the full product
`outputs O;. The three-dimensional nature of this array
`can perhaps be better visualized by referring to FIG.
`8B
`This three-dimensionat STAMS systolic array con-
`355
`figuration is an improvementover the two-dimensional
`configuration inasmuch as fewer processing, i.e. clock,
`oy=7s! 8s! wy v,
`
`= E=0 Fa0 1LQE+ Ft" OE+F+t
`cycles are required to complete each product output O,.
`Processing time is reduced since each one-dimensional
`array,
`i.e. each and vector signals. It would be still
`further desirable to have such a multidimensional sys-
`tolic processing array with minimal matrix signal stor-
`age requirements for each processing element.
`SUMMARYOF THE INVENTION
`
`45
`
`50
`
`60
`
`65
`
`A multidimensional systolic array processor in accor-
`dance with the present invention has an architecture
`which maximizes processing parallelism and minimizes
`global interconnections. Further, the present invention
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 21
`
`

`5,274,832
`
`10
`
`20
`
`25
`
`30
`
`5
`minimizes local matrix signal storage requirements
`within each processing element.
`Thepresent invention maximizes processing parallel-
`ism by interconnecting its processing elements along
`multiple dimensions. Therefore,
`systolic processing
`occurs along multiple dimensions. For example, a two-
`dimensional systolic array processor in accordance with
`the present invention includes a PxQ matrix having P
`rows and Q columns of processing elements, each of
`which is systolically coupled row-to-row and column-
`to-column for full pipeline processing within each row
`and each column. A three-dimensional systolic array
`processor has a PxQxT array with T planes of P rows
`and Q columnsof processing elements, each of which is
`systolically coupled plane-to-plane-to-plane for
`full
`pipeline processing.
`Thepresent invention minimizes global interconnec-
`tions of the processing elements. For the two-dimen-
`sicnal case, appropriate matrix and vector signal subsets
`are inputted to only one row and one column of the
`two-dimensional processing array. These matrix and
`vectorsignal subsets are specifically formed so that they
`need to be inputted to only one row and one column,
`and yet still be properly processed systolically along all
`dimensions within the array.
`For the three-dimensional case, appropriate matrix
`and vector signa] subsets are inputted to three perpen-
`dicular planes of the three-dimensional processing ar-
`ray. For higher-dimensional cases, appropriate matrix
`and vector signal subsets are similarly inputted to the
`higher-dimensional processing arrays.
`The present invention minimizes local matrix signal
`storage requirements by inputting specifically formed
`matrix signal subsets to only one row and one columnof 35
`the two-dimensional processing array, and to three
`perpendicular planes of the three-dimensional process-
`ing array. These matrix signal subsets are formed to
`allow thesizes of the matrix signal subsets to be reduced
`as they are systolically transferred to subsequent pro-
`cessing elements along each dimension within the array.
`Asthe matrix signal subsets decrease in size through the
`array, the local storage, e.g. memory, needed for tempo-
`rarily storing each matrix signal subset within each
`processing elementis successively reduced. Processing
`speedis not sacrificed since the matrix signal subsets are
`transferred to the subsequent processing element at a
`clock rate higher than that used for systolically transfer-
`ring the vector signa] subsets.
`These and other objectives, features and advantages
`of the present invention will be understood upon con-
`sideration of the following detailed description of the
`invention and the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`6
`FIGS. 7A-7Billustrate diagrammatically a conven-
`tional matrix-vector multiplication, wherein the matrix
`and vector of FIGS. 5A-5B are further subdivided into
`matrix and vector subsets, respectively.
`FIGS. 8A-8Billustrate a conventional quasi three-di-
`mensional systolic array.
`FIG. 9 illustrates the Layer 1 and 2 neurons of FIG.
`2 reconfigured as two-dimensional! neural arrays in ac-
`cordance with the present invention.
`FIG.10 illustrates diagrammatically the reconfigura-
`tion of the one-dimensional vector signal set of FIG. 1
`into a two-dimensional vector signal set in accordance
`with the present invention.
`FIG. 11 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector columnsig-
`nal subsets in accordance with the present invention.
`FIG. 12 illustrates diagrammatically the matrix-vec-
`tor multiplication of the matrix and vector row signal
`subsets in accordance with the present invention.
`FIG, 13illustrates a block diagram of a two-dimen-
`sional systolic array processor in accordance with the
`present invention.
`FIG. 14illustrates a single processing element of the
`two-dimensional processor of FIG. 13.
`FIG. 15 illustrates a functional block diagram of a
`single processing element of the two-dimensional pro-
`cessor of FIG.13.
`FIG. 16 illustrates the reduced local matrix signal
`storage requirements of the two-dimensional processor
`of FIG. 13.
`FIG. 17 further illustrates the reduced local matrix
`signal storage requirements of the two-dimensional
`processor of FIG. 13.
`FIG. 18 illustrates a block diagram of a three-dimen-
`sional systolic array processor in accordance with the
`present invention.
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`45
`
`350
`
`55
`
`two-layer
`
`1 illustrates diagrammatically a conventional
`FIG.
`matrix-vector multiplication.
`FIG. 2 illustrate a simple conventional
`artificial neural network.
`FIG. 3 illustrates a conventional one-dimensional
`“ring” systolic array.
`FIG. 4 illustrates an alternative conventional one-
`dimensional systolic array.
`FIGS. 5A-5B illustrate diagrammatically a conven-
`tional matrix-vector multiplication, wherein the matrix
`and vector are subdivided into matrix and
`FIGS. 6A-6Billustrate a conventional quasi two-di-
`mensional systolic array.
`
`60
`
`65
`
`Referring to FIG. 9, the Layer 1 and 2 neurons nj, n2
`of an artificial neural network (as shown in FIG. 1) are
`reconfigured into two-dimensional neural arrays. The
`original input signals ly now have double subscripted
`notation to reflect
`the two-dimensional set of input
`signals Iy.z. Similarly, the original voltage signals Vz
`now have double subscripted notation to reflect
`the
`two-dimensionality of the set of voltage signals Vy,z.
`Indicated in brackets for some of the Layer 1 neuronsin
`FIG. 9 are the original Layer 1 neuron and voltage
`identifiers (the remaining identifiers being left out for
`clarity).
`The Layer 2 neurons also now have double sub-
`scripted notation to reflect the new two-dimensionality
`of the set of Layer 2 neurons n4_8. Indicated in brackets
`for some of the Layer 2 neuronsare the original Layer
`2 neuron identifiers (the remaining identifiers being left
`out for clarity).
`The original matrix, e.g. “weight,” signals W;, now
`have quadruple subscripted notation to reflect their new
`multidimensionality, i.e. W.4,.y,z. The first subscripted
`pair of characters “‘A,B” represents the destination neu-
`ron nya in the second layer. The second subscripted
`pair of characters “Y,Z" represents the source of the
`voltage signal Vyzin thefirst layer. Indicated in brack-
`ets for someof the matrix signals W4\5.y,zare the origi-
`nal matrix signal identifiers Wy, (the remaining identi-
`fiers being left out for clarity).
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`Petitioner Microsoft Corporation - Ex. 1020, p. 22
`
`

`7
`It should be understood that the representations of
`the Layer 1 and 2 neurons, along with their subscripted
`notations, were selected arbitrarily. They can be recon-
`figured as desired, provided that
`the resulting sub-
`scripted notation for the matrix signals be consistent
`therewith.
`It should be further understood that it has been as-
`sumed for the sake of simplicity that M=K for the
`reconfigured array of Layer 2 neuronsas represented in
`FIG. 9. However, this is not necessary to the present
`invention. Ideally, the numbers of rows and columns
`should be equal, i.e. P=Q. This is to maximize the im-
`provements in processing speed and to minimize the
`local matrix signal storage requirements in accordance
`with the present invention. However,
`if K cannot be expressed as a square,i.e. if P=+Q, then
`the numbers of neurons in both Layers 1 and 2 can be
`approximated to the nearest square. Extra processing
`cycles would be required because of such an approxima-
`tion, but significant improvements in processing speed
`and local matrix signal storage requirements would still
`be realized.
`Referring to FIG. 10, the reconfiguration of the one-
`dimensional vector signal set Vzof FIG. 1 into a two-di-
`mensional vector signal set Vy,z can be understood.
`The one-dimensional vector signal set Vy is initially
`mapped into a two-dimensional vector signal set with
`the original subscripted notation left intact. This two-di-
`mensional vector signal set is then given new doubie
`subscripted notation to reflect the two-dimensionality
`of the reconfigured vector signal set V y,z.
`Also indicated in FIG. 10 (with dashed lines within
`the two-dimensional vector signal sets) are two vector
`signa] subsets. One is referred to as the vector column
`signal subset Vc and the otheris referred to as the vec-
`tor row signal subset Vr. These vector signal subsets
`Vo. Vrare multiplied by the matrix signals W4,2.y,z as
`represented in FIG. 9. The matrix signals W42.y,z are
`separated into corresponding matrix column Wc and
`row Wpsignal subsets. The vector column Vc and row
`Vr signal subsets are multiplied by the matrix column
`Weand row Wrsignal subsets, respectively, as shown
`in FIGS. 11 and 12. Therefore, the matrix-vector prod-
`ucts O;are identified according to the following formu-
`las:
`
`O;=Oc+OR
`
`where:
`
`OF Wanyzv.
`oc= 2
`CH xe yox ABYZ"
`Zz
`Ja Y-Z-P+ 2 P-24+2)
`X= Z, for odd Z
`= Z + 1, for even Z;
`
`and
`
`0.
`
`V,
`g W,
`£
`R= Yo1Z2=x ABYZ' I
`Yy
`J=Z-Y¥-QO+ 2 @-¥+2)
`X = Y, for even Y
`
`5,274,832
`
`8
`-continued
`= Y + 1, for odd YF.
`
`13
`
`20
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`Based upon the foregoing. the vector Vc, Vr and
`matrix Wo, Wp signal subsets can be identified accord-
`ing to the following formulas:
`
`where:
`
`Zz
`J=Y-Z—-—P+ X (P-Z+2)
`Z=)
`
`Z=1,3,3,...,9
`
`YSRMX41,¥42,...,P
`
`X = Zfor odd Z
`= Z+ 1, for even Z;
`
`and
`
`Vr = Vyz= ¥3
`
`Wr= WaByz
`where:
`
`Yr
`J=Z-¥-O+ 3 @-¥+d
`Y=o1,2,3,...,P
`
`ZX X4+1,X¥4+2,....9
`
`X = ¥, foreven Y
`= Y+ 1, for odd ¥
`
`Referring to FIG. 13, a two-dimensional systolic
`processing array 100 in accordance with the present
`invention includes K processing elements 102, desig-
`nated by N..7 where A is an elementofthe set {1,2,3,.
`.
`. P}, B is an element of the set {1,2,3,... ,Q} and
`K=PQ. The processing elements 102 are intercon-
`nected in a two-dimensional matrix 100 having P rows
`and Q columns.
`The processing elements 102 are mutually coupled
`column-to-column via matrix row subset signallines 104
`and vector row subset signal
`lines 106. These signal
`lines 104, 106 provide the means by which the matrix
`WR and vector Vr row signal subsets are systolically
`transferred column-to-column among the processing
`elements 102. The processing elements 102 are further
`mutually coupled row-to-row via matrix column subset
`signal lines 108 and vector column subset signal lines
`110. It is by these signal lines 108, 110 that the matrix
`Wceand vector Vc signal subsets are systolically trans-
`ferred row-to-row among the processing elements 102.
`All processing elements 102 in the first row 112 re-
`ceive a vector signal input 114 whichis a vector column
`signal subset Vc of the vector signal set V. All process-
`ing elements 102 in the first row 112 further receive a
`matrix signal input 116 which is a matrix columnsignal
`subset Wc of the matrix signal set W.
`All process

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases