throbber
376
`
`IEEE TRANSACTIONS ON COMPUTERS. V01... 42. NO. 3. MARCH 1993
`
`Systolic Modular Multiplication
`
`Colin D. Walter
`
`Abstract—A systolic array [of modular multiplication is presented using
`the ideally suited algorithm of P. 1.. Montgomery. Throughput is one
`modular multiplication every clock cycle, with a latency of 211 + 2 cycles
`for multiplicands having n digits. Its main use would be where many
`consecutive multiplications are done, as in RSA crypbmiysiiems.
`
`Index Terms—Cryptography. digital arithmetic methods, last computer
`arithmetic. modular multiplication, BSA algorithm. systolic array.
`
`i.
`
`ItuTttooucrIoN
`
`Among other reasons, security for an ever—increasing number of
`electronic banking transactions has fuelled research into crypto-
`graphic algorithms and eflicient implementations of them. Some of
`the main systems for encryption or key transfer, such as RSA [1'].
`require fast modular multiplication of numbers containing 500 or
`more bits. Here we describe a systolic array for performing this. It
`provides another alternative for the rapid encryption of built data.
`Two pieces of- related work have appeared during revision of this
`paper. One. by K0; and Hung [4]. is the only attempt so far at a
`systolic algorithm for modular multiplication. The other. by Shand,
`Bertin. and Vuillemin [8]. describes a pipeline similar to one row
`of the array presented here which the authors have programmed
`into their hardware array. The first of these suffers from excessive
`latency and a slow clock,
`the result of the unsuitability of the
`natural algorithm [1] which is based on Homer's nested multipli-
`cation method. It involves repeated additions of the multiplicand and
`repeated subtractions of the modulus. These are interleaved to keep
`register size down. For speed, just a few of the most significant digits
`of the partial product are used to decide the multiple required in the
`next modular subtraction. In a digit-level systolic array, this multiple
`must be pumped down the cells performing the subtraction from the
`cell for the most significant digit to that of the least. But. because
`carry propagation is in the opposite direction, a redundant number
`system has to be used to limit the influence of carries. However. a
`delay is still caused while the limited carries are accumulated at each
`cell. 80. overall. the first digit of the output in [4] appears after about
`131a I 2 clock cycles. where n is the maximum number of digits in any
`input. Moreover, the clock cycle needs to be slow enough to allow
`for calculating the multiple of the modulus.
`The clash in direction between the movement of the carries and
`that of the multiple of the modulus is resolved by using P. L.
`Montgomery’s algorithm [6] which restructures the operation so that
`the modular adjustment depends instead on the least significant digits.
`Like Shand er al. [8]. we make use of this and, like them. we can
`expect similarly impressive computation speeds. The result of this
`choice is a truly systolic algorithm with much reduced latency and a
`faster clock. Indeed the most complex cell is the most common one,
`which performs two digit multiplications and three digit additions,
`and the delay til]
`the first output digit appears is just 2:: + 2
`clock cycles. There is. unfortunately, a price to pay for the greater
`
`Manuscript received May l'i', 1991: revised October 25. [99! and March
`1, 1992.
`The author is with the Computation Department. U.M.I.S.T.. Manchester
`M60 10!). U.K.
`IEEE Log Number 9202845.
`
`the
`efficiency. Some post-processing of the output is required, but
`cost of this is relatively slight if a number of modular multiplications
`with a common modulus are performed sequentially. as in the RSA
`cryptosystem.
`
`ll. THE BASIC Amsrmst
`
`The ideas which are combined here are those of Montgomery
`[6] on modular arithmetic, and of McCanny and McWhitter IS] on
`systolic multiplication. Montgomery shows how to perform modular
`multiplication by a method which includes reversing the order of
`treating the digits of the multiplicand, shifting down instead of up.
`and adding rather than subtracting multiples of the modulus. A Pascal-
`like description of his algorithm for (Ax B} mod M is given below.
`and is followed by a short justification of how it works. Further detail
`is found also in [2]. First, however. we review the notation that is
`used.
`
`Numbers A are written with base 1' and digits .4[i], so that
`A = 23;; A[i]r‘ where it is the number of digits in A. The choice
`of digit range is unimportant, but we interpret modr as yielding a
`digit value. Let m and ll- be the number of digits in the inputs M and
`A, respectively. Normally, B has at most the same number of digits
`as M or, at worst, satisfies B < 2M. We assume the latter. so that B
`has at most m+1 digits. The radix r is chosen before implementation
`so that the operations div 1' and modr are trivial. In particular, here
`they are done by shifting or inspecting the lowest digit, respectively.
`The choice for r will probably be a small, fixed power of 2 to make
`translation to and from binary easy.
`there is no
`A precondition for the algorithm to work is that
`common factor between r and M. in the definition of Q[i] below.
`the multiplicative inverse (r — M [[l]l_1 mod r is required. This is
`the digit which, when multiplied by r — M[0]. gives a product with
`remainder 1 on division by r. The co-primeness of M and r ensures
`that such a number exists. inspection of the assignment to P shows
`that there is a multiple of r to be divided by 1". Hence no information
`is lost in the division, which is therefore exact.
`P := 0:
`Fori
`:= Oton — 1 do
`Begin
`
`QM :: ((P[t‘l]+A[i]xB[0]) x {r—rl'flOll—l) mod r:
`P
`:= (P + AMXB + omxm div 7‘
`
`End
`let P. be the value of the partial
`To see what this code does.
`product P at the start of the loop for which i is the value of the control
`variable. So Pa = D. Let A. = 22;; AMrJ and Q. = 2;; QUIr’
`be the lower parts of A and Q. So :19 = Q0 = 0. Then, by induction,
`r’P. = A.><B + thM'. Thus the final value Pu of P satisfies
`r"P = AXB+Q><M, so that P E {r""ABl mod .M. The systolic
`array described here produces output P with this extra power of r.
`but it may also be used to remove that factor. This is done by an
`extra application of the operation with inputs P and 1-2" mod M,
`the latter being precomputed once for all and stored with M. Further
`detail in the context of the RSA algorithm is provided in Section IV.
`Suppose the standard set of digits {0. l. -
`-
`- . r — 1} is used for
`A and Q. Then the maximum value 8 of P satisfies tr = (h‘ +
`[r —1}B + [r — UM) div 1-. and so a: : M + B. Once the
`digits of A run out, subsequent values of P decrease, being bounded
`above by it. (i = 0, 1. ---] which satisfy so = s = M + B and
`s.+l = {s.+(r—1)M) div r. i.e.. n.- = Ill—HE div 1"}. These yield
`bounds on the number of digits needed to represent the intermediate
`
`0013~9340193$0100 © 1993 IEEE
`
`Petitioner Microsoft Corporation - Ex. 1063, p. 376
`Petitioner Microsoft Corporation - Ex. 1063, p. 376
`
`

`

`[SEE TRANSACTIONS ON COMPUTERS. VOL. 42. NO. 3. MARCH 1993
`
`3??
`
`and final values of P. If initially B «C 2M then P of. 3M always.
`so that P requires up to two bits more than M. Furthermore, in the
`next iteration beyond the one using the last, most significant digit of
`A, the output satisfies P < M + (2M div 1'} 5 2M. This would be
`suitable for reuse as the B input of a further modular multiplication.
`
`Ill. The Svsrotlc ARRAY
`
`A key property of Montgomery’s algorithm is that the choice of
`modulus multiple is based on the lowest digit P[0] of the partial
`product. With this multiple Q[i] determined, the digits of the i + lst
`partial product P.+i can be computed in order staning with the
`lowest. Also, with Q[i] known, digits output from one addition cycle
`enable the corresponding digits for the next cycle to be found. This
`is done by the array illustrated in Fig. 1 where each row performs
`an iteration of the loop and columns compute successive values
`for a single digit position. The typical oell performs a single digit
`calculation of the assignment P := {P+A[-i] >< B+Q[i] x M) div r,
`and generates a carry in the normal way. So it is specified by
`
`R-i—lU ~1]+ er'nrryo... =
`
`Pelt] + A[i]xB[j] + Q[t']xMIj] + Curt-m".
`
`The i + 15: row down computes Qli] and PHI, whilst the j + Ist
`column from the right finds the digit with index j for each Pi. The
`input digits of M, A and B, and also those of Q, are just pumped
`through cells without change until a different modular multiplication
`is commenced. In the cases of .M and B the digits need delaying by
`an additional clock cycle (i.e., the operation time of a cell). so that
`they arrive when needed. This is achieved by the latches indicated
`between the cells.
`
`The cells also forward up the rows the carries from the additions.
`[f the digits of each number lie in the usual range {0,1.---, r—l}
`then the above definition of the cell shows that carries are bounded
`by 2ir—ll. Thus the carry needs one more bit than a digit for its
`representation.
`the end of the previous section
`Suppose the setup described at
`is used so that B initially satisfies B < 2111. Then we wish to
`perform n + 1 addition cycles to obtain output satisfying P 4. 2M.
`For this there must be it + 1
`rows in the array and A must be
`padded at the top end with an extra zero digit, i.e., A[tt] = 0. The
`output we desire is P“. from the row with input A[n.]. It will equal
`{a-"“1 AB} mod M, or be M more than this because it is bounded
`by 2M. So, a further row might usefully be added to subtract the
`possible extra M. with the decision about whether to take PM“ or
`P...“ - it! being made when the sign of the latter is established.
`This is further discussed below.
`Intermediate values of P are bounded by 3M. This entails that
`there is no carry from the column in which each P,[m + 1]
`is
`computed. So the array needs no column to the right of this and,
`since R-I-m + 2] = 0 always, this value can be input at the left side
`of the array. Along the top edge the inputs Pofj] are all {I because
`the initial value of P is 0. Also on this edge. the digits of M [J] and
`BU] are input 3' clock cycles after M10] and B [[1] so that they match
`the progress of the carries.
`The cells are all
`identical with the exception of the rightmost
`column. illustrated in Fig. 2. which has the burden of computing the
`digits of Q. Carries entering the array from the right are, of course,
`zero. Also, the digit A[i] is input 2i clock cycles after A10} since
`the shift down puts the calculation of each P. two cycles behind its
`predecessor. The digits of Q are defined as in the Pascal code by
`
`op] = (tmot + .~t[i]xB[tJ]) x {r — molt”) mod 1'.
`
`
`
`Fig. 2. The rightmost cells.
`
`Although the rightmost cells do not generate output digits 3+11— 1]
`because they are deleted by the div :- operation. nevertheless a carry
`may be generated when evaluating the 0th digit before the shift down:
`
`r x Carryout = Hi9] + Aiilelm + Qii]xi\f![fl].
`
`in [3] Eldridge and Walter describe how to simplify the calculation
`of in] by shifting B up to make Blfil = (I. This is possible
`here too, but at
`the cost of an extra row and an extra column to
`obtain an output less than 2M. The advantage is that it reduces the
`complexity of the rightmost cells. and hence their operation time, to
`at most that of the standard cells. Such a simplification maximizes
`the possible clock speed at very little cost. For this it is assumed that
`(t- — MM)" mod :- is pteoomputed once (a table look-up} and fed
`in like Mm]. There may also be a slight advantage in scaling M to
`31' = ((r — MIOIFl mod rlxM since then M'[0] E —1(modr)
`leads to the simple definition Q[i} = R-[U]. Computing would then
`be done modulo M" giving a result bounded by 2M' rather than 2M.
`So the penalty for that would be extra cleaning up afterwards.
`The first output digit of a modular multiplication appears after
`2n + 2 clock cycles, and successive digits over the next to clock
`cycles. Indeed, the output PU] appears 2n + 2 cycles after B [J]
`is
`input and 2n. + 2 + 35 cycles tailor the very first input, namely B[01.
`The latency is thus 2n+2. However. since to +1 digits are output on
`each cycle. the throughput is equivalent to 1 modular multiplication
`per cycle.
`Comparison with the array of Ken; and Hung [4] is worthwhile. The
`logic of cells is simpler here because there is no need for a redundant
`number representation. Also the clock cycle time is shorter. In [4]. this
`is bounded below by the time to compute the equivalents of the digits
`
`Petitioner Microsoft Corporation - Ex. 1063, p. 377
`Petitioner Microsoft Corporation - Ex. 1063, p. 377
`
`

`

`318
`
`IEEE TRANSACTIONS ON COMPUTERS. VOL 42, NO. 3. MARCH 1993
`
`
`
`Fig. 3. The typical cell for radix r = ‘2.
`
`Q[i] in their super cells Li's and LUS. For radix 2 and a technology
`using only 2-input gates and no buffering to enable outputs to drive
`more than one load,
`this requires a circuit similar to that given in
`[3] Fig. 2, with its critical path length of 13 XOR gates. Here a
`depth of 5 gates sud-ices for the general cell (see Fig. 3), and less
`for the rightmost cells, when the suggested simplifications are made.
`Allowing for register setup and hold times, the clock here should be
`about [13+3}f(5+3) :: 2 times faster. with the same speedup factor
`for the throughput. Since there are 2n -i- 2 clock cycles per operation
`instead of about l3nf2, the latency is reduced roughly sevenfold.
`Different technologies allow tradeoffs here between power, area, and
`speed.
`
`IV. RSA Cavmomnv
`
`In practice the systolic array is likely to he of most use for
`RSA cryptography where modular exponentlatioo is needed. This
`is normally performed by repeated modular multiplication. It has
`already been observed that the array calculates (ABr"‘"} mod M
`rather than AB mod M, and an extra :1! may be present. The extra
`M makes no difl'erence to subsequent modular arithmetic, and so can
`be ignored until the final result of decryption is obtained. The extra
`factor of 1I-'"-I can be removed after each multiplication by using
`the array again to multiply (rifle—““11 mod M and 1'2"“ mod M.
`However. in exponentiation this also can be left till the end of the
`calculation.
`'
`.5 and D,
`Encryption and decryption is done using two keys,
`respectively, with the property ADE E A mod M. Suppose the en—
`cryption of A is done using the systolic array to successively square .4
`modulo M and to multiply these 2-powers as necessary into a running
`total, retaining the unwanted powers of t. Then (.fili'fl“1 )5 mod M
`is produced rather than A5 mod M. Hence, using the array for an
`extra modular multiplication by rl"+lll5+” mod ill will provide
`A5 mod M. The cost of this extra multiplication is minor compared
`to the typical number of modular multiplications in the encryption
`process. Naturally, the owner of the key E only needs to compute
`”Milli-+1] modM once and store it for use until
`the key is
`changed. Decryption is similar. In fact, only the decrypter needs
`to adjust for the power of r, and the property of DE ensures that
`the scaling factor still depends only on his key. Another alternative
`strategy is described in [3].
`The main overhead might be in the over-large residue. Although
`this problem is shared by all
`fast
`implementations of modular
`multiplication (see [3]), it seems worse here because the top output
`digits are generated last of all. Apparently, they need to be known
`in order to decide whether a final subtraction of M is necessary or
`not. This is true in general but false under appropriate conditions!
`
`An easy solution is always to make the message for encryption into
`a multiple of 1-,
`thus ensuring that the lowest digit
`is 0. Then, at
`decryption. the output will have lowest digit M [0] if, and only if, the
`extra multiple of M is present. A bottom row of special cells can be
`added to the systolic array to remove ((P[{]] x M[0]_' ) mod r) ml!
`and produce the required answer. (The same formula works if up to
`(r — 1}!” needs subtracting, and there is an obvious generalization
`for higher multiples.)
`For RSA applications, typical inputs have about 103,! 2 bits. This
`means about 106,121 bit-processing cells, or about 4 x106 gates. As this
`may be beyond available technology for a single chip, suppose that as
`many rows as possible are put onto the chip. Theoretically, a number
`of chips could be used sequentially to construct
`the whole array,
`but in practice it would be impossible to transfer the required 103}?
`bits per cycle between the chips. Alternatively, it is clearly better to
`reroute the output back to the input within a single chip, repeatedly
`feeding it back in at the top digit by digit as it is calculated, until the
`required n + 1 partial products have been computed. If just 11' rows
`of cells are built, then at full capacity the array is simultaneously
`processing 2n’ diflerent modular multiplications.
`it takes 2:; + 2 clock cycles from the input of one digit to the
`output of the corresponding digit after the modular multiplication. it
`is only after such a delay that further progress on an exponentlation
`can take place. (Strictly speaking, for a nonzero exponent bit, there
`is a modular multiplication into a running total and a modular
`squaring which can be done simultaneously.) Hence the modular
`multiplications which the array is performing simultaneously must
`come from different exponentiations. This indicates that for use in
`RSA the messages for encrypting or decrypting should generally be
`numerous. Of course, one interesting choice is to reduce the array to
`a pipeline by just implementing 1 row. This would give a very cheap,
`simple modular multiplying chip for performing single encryptions.
`
`V. CONCLUSION
`
`A systolic array for modular multiplication has been presented.
`Although there is an extra unwanted factor present because of the
`choice of algorithm. this can be dealt with at virtually no cost when
`the algorithm is used in RSA cryptography. Since in the simplest
`case a single digit is used to determine the multiple of the modulus
`to add during iterations of the algorithm,
`the clock can be made
`very fast. Neat solutions have been included to avoid finishing with
`output containing an extra unwanted multiple of the modulus, and to
`achieve the best possible clock speeds.
`
`REFERENCES
`
`[1] E. F. Brickell. "A fast modular multiplication algorithm with application
`to two-key cryptography." infld‘mncer in Cryptology-Prom afcnmc
`82. Chaurn et at. Eda. New York: Plenum, 1933, pp. 51—60.
`[2] S. E. Eldridge. “A faster modular multiplication algorithm," imam. J.
`Contact. Mark, vol, 40, pp. 63—68, 1991.
`[3} 5. E. Eldridge and C. D. Walter, "Hardware implementation of Mont‘
`gomery’s modular multiplication algorithm," IEEE Trans. Compare, to
`be published.
`'
`[4] C. K. Koo and C. Y. Hung, “Bit-level systolic arrays for modular
`multiplication," J. W5! Signal Processing, vol. 3, pp. 215-223. 1991.
`I. V. McCanny and J. G. McWhirter, “Implementation of signal pro-
`cessing functions using 1-bit systolic arrays," Etecirm. Lett, vol. 18.
`pp. 241-243, 1982.
`[6] P. L. Montgomery, "Modular multiplication without trial division,"
`Momma biomass, vol. 44. pp. 519-521. 1935.
`[1'] R. L. Rivest, A. Shamir, and L. Adleman. “A method for obtaining
`digital signatures and public-trey cryptosystems," Commun. ACM, vol.
`21, pp. 120—126, 1978.
`[8] M. Strand. P. Berlin, and .l. Vuiilemin, “Hardware speedups in long
`integer multiplication.” ACM Sigerch, vol. 19, pp. 106413. 1991.
`
`[S]
`
`Petitioner Microsoft Corporation - Ex. 1063, p. 378
`Petitioner Microsoft Corporation - Ex. 1063, p. 378
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket