throbber
Fast Implementations of AES Candidates
`
`Kazumaro Aoki1 and Helger Lipmaa2
`
`1 NTT Laboratories
`1-1 Hikarinooka, Yokosuka-shi, Kanagawa-ken, 239-0847 Japan
`maro@isl.ntt.co.jp
`2 K¨uberneetika AS
`Akadeemia tee 21, 12618 Tallinn, Estonia
`helger@cyber.ee
`
`Abstract. Of the five AES finalists four—MARS, RC6, Rijndael, Twofish—
`have not only (expected) good security but also exceptional performance on the
`PC platforms, especially on those featuring the Pentium Pro, the NIST AES
`analysis platform. In the current paper we present new performance numbers
`of the mentioned four ciphers resulting from our carefully optimized assembly-
`language implementations on the Pentium II, the successor of the Pentium Pro.
`All our implementations follow well-defined API and timing conventions and
`sensible guidelines, like no using of self-modifying code and key-specific static
`data — i.e., tricks that speed up the implementation but at the same time restrict
`the field of application. Our implementations are up to 26% percent faster than
`previous implementations. Our work also shows how a simple change (inclu-
`sion of the MMX technology) in the analysis platform can influence the relative
`encryption speed of different ciphers. To enable everyone to compare their imple-
`mentations to ours, we also fully specify our procedures used to obtain the speed
`numbers.
`
`1
`
`Introduction
`
`For more than 20 years, DES [FIP77] has been a widely employed cryptographic stan-
`dard. While the best cryptanalytic attacks against DES (differential and linear cryptanal-
`ysis) are still highly impractical, during the last years DES has became obsolete for its
`too short key and block sizes, not withstanding the current advances in computing tech-
`nology. Motivated by this, NIST initiated a new effort to replace DES as a standard. 21
`algorithms were submitted and 15 algorithms were accepted as AES (Advanced Encryp-
`tion Standard) candidates, of which 5 candidates—MARS [BCD+98], RC6 [RRSY98],
`Rijndael [DR98], Serpent [ABK98], Twofish [SKW+99b]—were chosen to the second
`round.
`However, the AES process was started not only due to the theoretical reasons: there
`are a few well-known constructions, including 3DES, that seem to have very good secu-
`rity margins. Unfortunately, 3DES, based on the hardware-oriented DES, is unsatisfy-
`ingly slow on the modern 32- and 64-bit computer architectures: modern block ciphers
`are up to 10 times faster than 3DES. Regardless of these ciphers having unproven (even
`by time) security properties, they are widely used in the industry by pragmatic reasons:
`hardware applications like 1 GBits/s Ethernet or on-the-fly encryption of 160 MByte/s
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 001
`
`

`

`SCSI hard disks are requesting for faster ciphers. Clearly, the situation of having a
`(moderately) secure and (moderately) fast de jure standard DES, a (probably) secure
`and (clearly) slow de facto standard 3DES and some fast but with unknown security
`margin de facto standards is not acceptable: there should be a single standard that is
`both secure and fast. This is one of the reasons why, when inviting the public to pro-
`pose candidates for the AES, NIST explicitly stated that the new standard should be
`both “more secure and faster” than 3DES.
`While security of the candidates cannot be exactly quantified by the currently known
`methods, it seems to be easier to measure their speed. However, there is still a lot of
`ambiguity in answering the question what AES candidate is the fastest. Several pa-
`pers (including [Lip99,SKW+99a]) have compared AES candidates speed, but since
`the implementations quoted in them are often incomparable (or based on pure estima-
`tions), one cannot make direct conclusions about the efficiency of the ciphers based
`on the published papers. Incomparability stems from the different implementation as-
`sumptions, API’s, hardware (e.g., processors) and software (e.g., compilers) used by
`implementers. Even more, some of the timings presented in previous papers correspond
`to “show-case” (as opposed to practically applicable) implementations, some exam-
`ples of those being the fastest implementation of Twofish [SKW+99b] that uses self-
`modifying code and Brian Gladman’s implementations of AES candidates [Gla99] that
`use a number of key-specific static variables instead of allocating a register to address
`them, therefore effectively freeing some registers for other uses. Especially in the case
`of the Pentium family, where the number of available registers is very restricted, such
`implementations may result in a huge speed up. However, both types of implementation
`tricks restrict the application area of the implementation.
`In the current paper we try to give a satisfactory answer to the question “what cipher
`is the fastest on the Pentium II” by carefully optimizing the 4 fastest AES candidates—
`MARS, RC6, Rijndael and Twofish—in Pentium II assembly, using for all implementa-
`tions exactly the same, reasonable in practice, API and speed measurement conditions
`for all the ciphers. Due to this, our results are much fairer than most of the previously
`known ones: our implementations can be seen as black boxes applicable in almost any
`possible application of block ciphers on an environment featuring Pentium II. Addi-
`tionally, careful optimization process resulted in implementations that are clearly faster
`than the previously known implementations. (Except for Twofish, which has still a faster
`“show-case” implementation.)
`We start the paper by describing our platform of choice (Section 2), implementation
`philosophy and API (Section 3). Section 4 briefly surveys our results, and Section 5
`gives more details on the problems encountered when implementing the ciphers. More
`information about the Pentium II is given in the Appendices.
`
`2 Choice of the Platform
`
`Our first principal choice was the decision what processor to use. By purely pragmatic
`reasons we decided that the implementation environment equips an Intel Pentium family
`CPU: while this family is not the most modern processor family available, it is the most
`widespread one at the moment of writing this paper and most probably also during the
`
`2
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 002
`
`

`

`next few years. Therefore, since in the foreseeable future most of the software-based
`commercial security applications run on the Pentium family (as recognized also by the
`AES finalists designers), this family has the most direct impact on the choice of a cipher
`by security consumers.
`At second, from the Pentium family we decided to choose the Pentium II processor.
`At first, it is a more advanced processor than Pentium Pro, the NIST AES analysis
`platform: the Pentium II provides (twice) larger register space due to the added MMX
`technology, and many new MMX-specific commands. Compared to the Pentium Pro,
`the Pentium II is also easier to obtain at the current stage, since Pentium Pro has been
`out of the manufacturing for a while. On the other hand, the Pentium II was preferred
`by the authors to the Pentium III since the latter is somewhat too new and controversial
`due to the privacy issues.
`Another reason to choose Pentium II was that as the successor of the NIST AES
`analysis platform, implementing the AES candidates on the Pentium II could provide
`some insights on how generally suitable are the candidates, some of which were specif-
`ically optimized for the Pentium Pro, on future processors having features unpredicted
`by algorithm designers. While this is not as crucial as withstanding the “future attacks”,
`it still gives some ideas about the possible longevity of the cipher. (We clearly would
`not want the AES in 20 years to have the role the 3DES has today!)
`As shown in [Lip98], the MMX technology can seriously speed up IDEA ([LM90],
`[LMM94]), one of the believably most secure block ciphers with 64-bit block size. As
`stated in [Lip98], this can be done since IDEA has its key attributes similar to those
`of multimedia applications, for which the MMX technology was originally created. An
`open question posed in [Lip98] was how much would the MMX technology help imple-
`menting other ciphers, including the AES candidates. In the following we will partially
`answer to that question, showing that also some ciphers using only “simple” operations
`can greatly benefit from the added MMX technology. A short overview of Pentium II
`that is necessary for implementers and for cryptographers who design ciphers optimized
`for this platform is given in Appendix A. We refer for Intel manuals for a more complete
`overview.
`
`3
`
`Implementation Considerations
`
`Several papers (including, in particular, [Lip99,SKW+99a]) have compared AES can-
`didates speed, but since the implementations quoted in them are often incomparable (or
`based on pure estimations), one cannot make direct conclusions about the efficiency of
`these algorithms based on the published papers. Incomparability stems from the differ-
`ent implementation assumptions, API’s, hardware (processors) and software (compil-
`ers) platforms used by implementers. Even more, some of the numbers there correspond
`to the “show-case” (as opposed to practically applicable) implementations; including
`the bizarre case that one candidate was claimed to be the fastest on its inventors laptop
`under some suitable conditions.
`As another example of the unsuitability of some “show-case” implementations, the
`fastest implementation of Twofish [SKW+99b] uses self-modifying code and therefore
`cannot be used in a number of applications, while Brian Gladman’s implementations of
`
`3
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 003
`
`

`

`AES candidates [Gla99] use a number of key-specific static variables instead of allo-
`cating a register to address them, therefore effectively freeing some registers for other
`uses. Especially in the case of the Pentium family, where the number of available reg-
`isters is very restricted, such implementations may result in a huge speed up. On the
`other hand, Gladman’s implementations cannot be used several applications, including
`multithreaded programs and SMP (symmetric multi-processing) systems.
`Most of the security customers need however speed numbers applicable in whatever
`product they use in whatever environment in runs (for example, in a Linux kernel-
`supported IPSEC implementation, secure login or multithreaded access to encrypted
`storage arrays). For users it is necessary to know in what environment the measured
`speed numbers were obtained, to be able to calculate the possible efficiency of the
`ciphers in their own environments. Additionally, full specification is important for other
`implementers to be able to compare their implementations with ours. Hence, apart from
`providing “clean” implementations under some reasonable public assumptions, we shall
`also next fully specify these assumptions:
`
`– We do not use self-modifying code (“code compilation” [SKW+99b]) since it
`makes the implementation inapplicable in a number of situations, e.g., in operation-
`system kernel and ROM-based applications.
`– We additionally decided not to use key-specific static areas since then the imple-
`mentation could not be used, e.g., in SMP-capable systems and multithreaded pro-
`grams.
`– We decided to maximally use the MMX technology since it should not be forbidden
`in any reasonable modern environment. (While using self-modifying code and key-
`specific static areas is generally considered to be a bad programming practice.)
`– We decided to use exactly the same API (specified later in Section 3.1) in all our
`implementations.
`– A number of well-understood assumptions that 1) improve the speed and can be
`easily followed by implementers or 2) are essential to even be able to measure the
`speed:
`• All codes and data are correctly aligned.
`• Input and output texts and codes are preloaded to L1 cache in the possible
`extent to reduce the number of cache misses.
`• Simplicity of code: we tried to reduce time spent during writing and optimiz-
`ing the code. In particular, all our implementations use highly optimized but
`round-number independent round macros. (Hence, our results could be slightly
`bettered if every round would optimized separately to avoid, e.g., delays in
`fetching stage.)
`
`3.1 API
`
`Since a different API can be influence the speed of an implementation severely, we also
`decided to fully specify the API used by us to make for the other implementers easier
`to compare their implementations to the ours. We felt that this is necessary, since AES
`candidate implementations reported in [Lip99] vary greatly in their API’s.
`
`4
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 004
`
`

`

`void xxKS(char *master, uint32 bitLen, char *eKey);
`void xxEnc(char *inBlk, uint32 lenBlk, char *eKey,
`char *outBlk);
`void xxDec(char *inBlk, uint32 lenBlk, char *eKey,
`char *outBlk);
`
`where
`
`xx is algorithm name (e.g., Rijndael).
`xxKS is key scheduling subroutine.
`xxEnc is encryption subroutine that encrypts lenBlk blocks of plaintext starting from the
`address inBlk to the ciphertext location outBlk, by using extended key eKey, in ECB
`block cipher mode.
`xxDec is decryption subroutine with the same input conventions as xxEnc.
`uint32 is the type of 32-bit unsigned integers (in the case of Pentium II, equal to unsigned
`long in the case of most compilers).
`master is pointer to the master key bits.
`bitLen is the bit length of a master key.
`eKey is pointer to subkeys and other initialization data, used later by encryption and decryption.
`inBlk is pointer to input texts to be encrypted in the case of xxEnc and to be decrypted in the
`case of xxDec.
`outBlk is pointer to the corresponding output texts.
`lenBlk is number of blocks to be encrypted or decrypted.
`
`Fig. 1. Specification of our API.
`
`Note that our API, depicted in Figure 1, is essentially equivalent to the API’s used
`in most of the commercial applications, specifying only those inputs and outputs to the
`algorithms that are really needed by the algorithms. (Names of the subroutines and their
`parameters of course do not affect the speed, of course.) Our API was fixed for the key
`length of 128-bits due to the feeling that at the time when greater key sizes become
`necessary, our implementation platform would already be a history.
`Here, the key schedule and decryption subroutines are specified only for complete-
`ness. Since in the current paper we are not interested in the optimization of these sub-
`routines, we almost do not mention decryption and key schedules hereafter.
`
`3.2 How to Measure a Number of Cycles
`
`Different time measurement methods may change the speed numbers quite dramati-
`cally. As in the case of the API’s, we decided to use one, sensible published and fully
`specified convention (specified in Figure 2) for all the implementations. (Note that this
`wrapping corresponds almost exactly to the method specified in [Fog00], to which the
`reader is referred for a throughout explanation of the method.) The inputs and key of
`the cipher are generated randomly before the measurement begins, to prevent “opti-
`mization” for specific class of keys. The input variable lenBlk was chosen to be equal
`to 8000 so that the input and output texts would not fit in the L1 cache. Also, time is
`a work area of type uint32, used in later calculations.
`
`5
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 005
`
`

`

`movd mm0, dword ptr [time]; /* warm cache and set MMX state */
`xor eax, eax;
`cpuid;
`rdtsc;
`mov dword ptr [time], eax;
`xor eax, eax;
`cpuid;
`/* xxEnc() or xxDec() */
`xor eax, eax;
`cpuid;
`rdtsc;
`sub dword ptr [time], eax;
`emms;
`
`/* serialize instructions */
`/* read time-stamp counter */
`/* save counter */
`
`/* serialize instructions */
`
`/* serialize instructions */
`/* read time-stamp counter */
`/* compute the difference */
`/* empty MMX state */
`
`Note that time is a 4 bytes work area.
`
`Fig. 2. Time measurement code
`
`/* push all used registers */
`cmp dword ptr [lenBlk], 0;
`jz L1;
`align 16;
`
`L0:
`
`L1:
`
`dec dword ptr [lenBlk];
`jnz L0;
`
`/* pop these registers once more */
`
`Fig. 3. Null function
`
`Note that this method has some overhead, due to both high latency of the rdtsc
`instructions and also the overhead caused by looping instructions like jnz which are
`not formally part of the cipher itself. (Looping instructions can be seen as a part of
`the block cipher mode, however.) We measure this overhead by using the null function
`shown in Fig. 3 obtaining a value nulltime, and then we subtract it from the value of
`time obtained by measuring the speeds of different encryption/decryption procedures.
`Finally, this result is divided by the number of blocks encrypted. Intuitively, by using
`this method we obtain the number of cycles corresponding to unrolled implementation
`of the block cipher, or to the implementation where we only care about the time en-
`crypting one block takes without adding any extra overhead. (Note that the subtracted
`overhead number was equal to ≈ 6 in the case n = 8000. One could easily add this
`number to those presented later to get the number of cycles with overhead.)
`Chosen time measurement method is also reasonable in practice: when the value
`of lenBlk was chosen to be different, for most of the implementations (including
`the implementation of null cipher), the execution times increased by almost the same
`constant. Hence, the null cipher proved experimentally to be well-defined.
`
`6
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 006
`
`

`

`Cipher
`
`Best previous result Speedup
`
`Cycles per
`Mbits/s on a 450
`block
`MHz Pentium II
`—
`6
`Null cipher —
`243 [Riv98]
`223
`RC6
`258 Mbits/s
`320 [DR98]
`237
`Rijndael
`243 Mbits/s
`315 [SKW+99b]
`282
`Twofish
`204 Mbits/s
`I
`I
`I
`I
`I
`I
`390 [BCD+98]
`306
`MARS
`188 Mbits/s
`Table 1. Performance in clock cycles per block of output of four AES finalists. (Only encryption
`considered)
`
`—
`8%
`26%
`11%
`22%
`
`Finally, we did a loop of 500 times over the described measurements and then chose
`the smallest number for every cipher, since that corresponds most likely to the case
`where most of the data and code are in L1 cache and the branch prediction works suc-
`cessfully: i.e., to the bulk encryption speed of the cipher itself.
`
`4
`
`Implementation Results
`
`From the five AES finalists, one (Serpent) is regarded as a very conservative design
`but at the same time also being clearly slower than the other AES finalists. Rest of the
`finalists have comparable timings on most of the modern computer platforms, where
`one of the ciphers is the fastest in one platform, and another one in another platform.
`Since also on the Pentium II processor, Serpent seems to be very slow by the published
`data, we decided postpone its implementation to the future and concentrate on the fast
`ciphers.
`Timings, obtained by measuring the speed of implementations by following pre-
`viously specified procedures are summarized in Table 11. The numbers in the middle
`columns show how many cycles it takes to encrypt one 128-bit block by using the cho-
`sen cipher with a 128-bit key. These results indicate that the chosen four AES finalists
`are extremely fast. For comparison, the standard hash algorithm SHA-1 hashes a 512-
`bit block in 837 cycles (i.e., 13.1 cycles per byte) and DES and 3DES encrypt a 64-bit
`block respectively in 340 and 928 cycles (resp., 42.5 and 116 cycles per byte) [PRB98],
`while RC6 and Rijndael respectively encrypt a 128-bit block in 223 and 237 cycles
`(resp., 13.9 and 14.8 cycles per byte). However, note that the cited timings in [PRB98]
`were obtained on a plain Pentium and therefore could most probably be improved on
`the Pentium II.
`Our results seem to indicate, that the speed difference between different ciphers is
`less than expected: as before, RC6 is still the fastest cipher on the Pentium II, but the
`difference between it and Rijndael has decreased seriously. Hence we hesitate to say
`that RC6 is the fastest cipher. However, based on the cited results, we can classify the
`ciphers to two groups: blastingly fast ciphers RC6 and Rijndael and somewhat slower,
`but still very fast ciphers Twofish and MARS.
`1 We also started to code the decryption routines, finishing RC6 decryption (209 cycles per
`block) and Twofish decryption (276 cycles per block).
`
`7
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 007
`
`

`

`However, one has to keep in mind that RC6 and MARS have design features that
`make them specifically efficient on the Pentium Pro (and its successors), while their
`performance seriously degrades on other processors [Lip99,SKW+99a]. This is due to
`the use of complex instructions (32-bit multiplication and data-dependent rotation) that
`are cheap on the P6 family (Pentium Pro, Pentium II, Celeron, Xeon and Pentium III)
`but very expensive on most of the other platforms. Interestingly, also the next generation
`Pentium processor (code-named “Willamette”, [Int00]) has latency 10 multiplication
`and latency 2 or 4 shifts, as compared to latency 4 multiplication and latency 1 shifts on
`the P6 family [Int00, Section 4.1.3]. Hence, RC6 and MARS would considerably slow
`down on the Willamette, the next generation Pentium family processor. On the other
`hand, Rijndael and Twofish are based on simple operations, and run equally well on
`all platforms. The speed ratio between Rijndael and Twofish seems be remain almost
`the same on the other platforms [Lip99] (namely, Rijndael being 5 . . . 25% faster than
`Twofish).
`Note that the speed up percents in Table 1 correspond to the achieved speed ups
`compared to the fastest “clean” implementations (i.e., those not using key-specific static
`data or self-modifying code). However, these percents do not always mean that our
`implementation techniques were exactly as much better. For example, the best previous
`implementation of Rijndael was done for the plain Pentium, but not for the Pentium Pro:
`a factor that may have negatively affected its performance. The best previous “clean”
`implementation of MARS was written in C, and therefore had also a relatively slow
`performance. However, our own C implementation of MARS is clearly faster than the
`one given in Table 1. In the case of Rijndael, most of the acceleration Rijndael is due
`to the efficient use MMX technology. In general, speed up comes mainly from better
`optimization (elaborated tradeoff between processor operating stages) and full usage of
`the Pentium II possibilities (i.e., the MMX technology).
`To further clarify how does the Pentium II architecture impact the speed, Table 2
`shows the detailed information of our implementations in encryption mode in the micro-
`operation level. Usage of the table is simple. For example, in the intersection point of
`“@round” row and “port 01” column in TwofishEnc table one would find 19. That
`means that there are 19 µoperations in the round function of TwofishEnc which will
`be executed on port 0 or port 1.
`Interestingly, our implementations of MARS, Rijndael and Twofish all require ap-
`proximately the same number of µoperations, while RC6 is about two times “better”
`in this category. On the other hand, RC6 is also the worst cipher to parallelize: while
`in Rijndael, more than 2.5 µoperations are executed per a cycle, RC6 can only mildly
`use the super-scalar parallelism of Pentium II. More cipher-specific comments will be
`given in the next.
`
`5 Cipher-Specific Comments
`
`5.1 MARS
`
`In the case of MARS [BCD+98], the speed difference between a carefully optimized
`C implementation (using a recent snapshot of the gcc compiler) and an optimized as-
`sembly language implementation is only about 11% on the Pentium II. The speedup
`
`8
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 008
`
`

`

`MARS encryption
`prewhitening
`forward mixing
`@core (×16)
`backward mixing
`postwhitening
`total
`RC6 encryption
`prewhitening
`@round (×20)
`postwhitening
`total
`Rijndael encryption
`whitening
`@round (×9)
`last round
`total
`Twofish encryption
`prewhitening
`first round
`@round (×15)
`postwhitening
`total
`
`II
`
`II
`
`II
`
`port 0 port 1 port 01 port 2 port 3 port 4 total
`(1.87 µops/cycle)
`13
`125
`18
`125
`21
`4
`4
`4 572
`4
`(1.47 µops/cycle)
`9
`15
`20
`5
`5
`5 329
`5
`(2.54 µops/cycle)
`15
`58
`64
`3
`3
`3 601
`3
`(2.11 µops/cycle)
`13
`34
`35
`23
`4
`4 595
`
`16
`6
`16
`
`128
`
`8
`
`1
`1
`
`160
`
`I
`
`1
`1
`
`I
`
`1
`1
`3
`13
`
`I
`
`I
`
`I
`
`1
`1
`
`I
`
`4
`4
`40
`
`5
`6
`2
`97
`
`5
`77
`9
`85
`8
`319
`
`2
`5
`4
`106
`
`8
`34
`31
`345
`
`5
`19
`19
`8
`317
`
`8
`32
`3
`32
`4
`124
`
`7
`2
`5
`52
`
`I
`
`6
`19
`20
`197
`
`I
`
`8
`10
`10
`4
`172
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`4
`4
`
`I
`
`II
`
`II
`
`II
`
`Table 2. Number of µoperations in our implementations
`
`comes mainly from a slightly more efficient allocation of the integer registers and some
`(minimal) usage of the MMX instructions in the assembly implementation. However,
`the MMX technology is only moderately useful for MARS, since the complex instruc-
`tions performed in MARS (i.e., 32-bit multiplication, data-dependent rotation and S-
`box lookups) are not available for the MMX registers. Additionally, due to the high
`data-dependency there is very limited freedom in meaningfully rescheduling the in-
`structions in MARS, which also means that one cannot avoid all the delays on all the
`processor operating stages.
`Another drawback is that during MARS encryption, some execution ports are con-
`siderably more overloaded than others. Namely, more than 78% of µoperations go either
`to port 0 or 1. The most overloaded is port 0, since 128 µoperations go only to this port
`— including 16 multiplications and extensively used rotations.
`
`5.2 RC6
`
`From implementers point of view, problems arising when optimizing an RC6 imple-
`mentation are similar to those arising when coding MARS in many aspects: both ci-
`phers rely on the same complex instructions, have long critical paths and overloaded
`
`9
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 009
`
`

`

`port 0. However, since RC6 uses multiplications even more extensively, it is even less
`parallelizable. Table 2 shows that our implementation includes 160 port 0 µoperations,
`which includes 40 multiplications with latency 4.
`RC6 is a very Pentium II-friendly cipher, and it is very easy to code it even in the
`assembly language. It can also be very efficiently implemented in C: the speed differ-
`ence between a C implementation and an assembly implementation is about 18%. (The
`difference is bigger than in the case of MARS since gcc, the test compiler, performs
`very poorly in translating the quadratic formulas of type x · (2x + 1) to the Pentium II
`assembly language.) It is straightforward to obtain an optimized assembly language
`implementation from the C implementation: one does not have many possibilities to
`reschedule the code.
`
`5.3 Rijndael
`
`As opposed to MARS and RC6, Rijndael [DR98] is not C-friendly (at least not gcc-
`friendly) in the sense that assembly implementation is about 44% slower than gcc-
`implementation of the same cipher. It is however mainly due to the inefficiency of the
`gcc compiler: our implementation of Rijndael makes very heavy use of the MMX
`technology, but also of 8-bit instructions provided by Pentium family. However, gcc
`cannot efficiently use either of these.
`Rijndael can effectively use the MMX since Rijndael is based only on most simple
`imaginable operations (load, xor), all of which are supported by the MMX technol-
`ogy. Additionally, since Rijndael has large internal parallelism (at least four-times, but
`partially up to 16-times parallelism!), there is a large number of possibilities to resched-
`ule its code. Our implementation was obtained by doing so in a way that all the delays
`in the different stages of the Pentium II operation would be minimized. The final result
`is very impressive for the Pentium II: it executes 2.54 µoperations per a cycle.
`Not the last factor that makes Rijndael suitable for the Pentium II is the fact that
`almost exactly one third of the µoperations in our implementation of Rijndael go to
`port 2, while the remaining 2/3 of µoperations go to ports 0 and 1. Due to this and
`parallelism we get that during the Rijndael encryption 3 µoperations could be executed
`in parallel almost all the time. However, this (not to mention other aspects like decoding
`and fetching delays) also makes 20 cycles per round a lower bound for Rijndael and
`shows that our result may be very close to the optimal one. To facilitate more efficient
`implementations, the Pentium II should feature three ALUs, two concurrent memory
`access ports and also more decoders and retirement units: features that are not cipher-
`specific and would improve the speed of most of the applications.
`Finally, we measured the timings of r-round Rijndael for variable r without any
`additional fine-tuning: those implementations are unoptimized since they use the same
`round macros as the 10-round Rijndael without any additional effort to optimize them
`to reduce, say, fetching delays. In particular it turned out that 8-round Rijndael (essen-
`tially equivalent to the cipher Square [DKR97] from the implementers point of view)
`encrypts a block in 193 cycles. 192-bit Rijndael (12 rounds) took 286 cycles, and 256-
`bit Rijndael (14 rounds)—333 cycles. Note that since 12-round Rijndael is very similar
`to Crypton [Lim98], 286 cycles is also a (hopefully) close approximation for the speed
`of latter.
`
`10
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 010
`
`

`

`5.4 Twofish
`
`Twofish is designed to be well-suited on multiple platforms, including also the Pen-
`tium II. From the implementers point of view it resembles Rijndael in many aspects, by
`using only simple instructions but also some large-scale components of the latter (e.g.,
`MDS, to provide diffusion). Due to the use of low-level instructions, Twofish is also
`relatively slow in C compared to the assembly (the difference is about 37%).
`Main difference for implementers between Rijndael and Twofish is the inclusion
`of the Pseudo-Hadamard Transformation that somehow complicates Rijndael’s clear
`structure and makes it less parallelizable: while the number of µoperations in our im-
`plementation of Twofish is less than in our implementation of Rijndael, it turned out
`to be very difficult to use the MMX technology to optimize Twofish. Hence, Twofish
`is only moderately parallelizable, although the parallelism of our implementation (2.11
`µoperations per cycle) is relatively good.
`
`6 Conclusion and Work in Progress
`
`We achieved the fastest implementations of four of the AES finalists on the Pentium II
`processor, obtaining speedup 8% . . . 26% compared to the previously known implemen-
`tations. Since all implementations were coded by using the same sensible assumptions,
`they provide a more adequate efficiency comparison of the AES finalists than the pre-
`vious papers. We demonstrated that MMX can be quite efficiently used to speedup
`Rijndael, but is only moderately useful for other ciphers. (However, our implemen-
`tations depend on the availability of MMX technology to a lesser or greater extent
`and in general do not run on the Pentium Pro.) We provided full specification on our
`time-measurement conditions to simplify for the future implementers to compare their
`implementations to ours.
`Our implementations are not the final: we continue optimizing them. Up-to-date
`results will be available at the AES efficiency table [Lip99].
`
`References
`
`[ABK98]
`
`Ross Anderson, Eli Biham, and Lars Knudsen. Serpent: A Flexible Block Cipher
`With Maximum Assurance. In The First Advanced Encryption Standard Candidate
`Conference, Ventura, California, USA, 20–22 August 1998.
`[BCD+98] Carolynn Burwick, Don Coppersmith, Edward D’Avignon, Rosario Gennaro,
`Shai Halevi, Charanjit Jutla, Stephen M. Matyas Jr., Luke O’Connor, Moham-
`mad Peyravian, David Safford, and Nevenko Zunic. MARS — A Candi-
`date Cipher for AES. Original paper and a tweak to it are available from
`http://www.research.ibm.com/security/mars.html, June 1998.
`Joan Daemen, Lars Knudsen, and Vincent Rijmen. The Block Cipher Square. In
`Eli Biham, editor, Fast Software Encryption ’97, volume 1267 of Lecture Notes in
`Computer Science, pages 149–165, Haifa, Israel, January 1997. Springer-Verlag.
`Joan Daemen and Vincent Rijmen. The Block Cipher Rijndael. In Third Smart Card
`Research and Advanced Applications Conference Proceedings, 1998. To appear.
`
`[DKR97]
`
`[DR98]
`
`11
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 011
`
`

`

`[FIP77]
`
`[Fog00]
`
`[Gla99]
`
`[Int99]
`
`[Int00]
`
`[Lim98]
`
`[Lip98]
`
`FIPS. Data Encryption Standard. Technical report, U.S. Department of Com-
`merce/National Bureau of Standards, National Technical Information Service,
`Springfield, Virginia, 1977. FIPS 46.
`Agner Fog. How to Optimize for the Pentium Microprocessors. Available from
`http://www.agner.com/assem/, 11 March 2000.
`Brian Gladman. AES algorithm efficiency. Unpublished. Information available
`from http://www.btinternet.com/˜brian.gladman/ cryptogra-
`phy technology/, January 1999.
`-
`Intel.
`Intel Architecture Optimization. Reference Manual, 1999. Order Number
`245127-001.
`Intel. Willamette Processor Software Developer’s Guide, February 2000. Order
`Number 245355-001.
`Specification and Analysis of CRYPTON Version 1.0.
`Chae Hoon

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket