`
`Kazumaro Aoki1 and Helger Lipmaa2
`
`1 NTT Laboratories
`1-1 Hikarinooka, Yokosuka-shi, Kanagawa-ken, 239-0847 Japan
`maro@isl.ntt.co.jp
`2 K¨uberneetika AS
`Akadeemia tee 21, 12618 Tallinn, Estonia
`helger@cyber.ee
`
`Abstract. Of the five AES finalists four—MARS, RC6, Rijndael, Twofish—
`have not only (expected) good security but also exceptional performance on the
`PC platforms, especially on those featuring the Pentium Pro, the NIST AES
`analysis platform. In the current paper we present new performance numbers
`of the mentioned four ciphers resulting from our carefully optimized assembly-
`language implementations on the Pentium II, the successor of the Pentium Pro.
`All our implementations follow well-defined API and timing conventions and
`sensible guidelines, like no using of self-modifying code and key-specific static
`data — i.e., tricks that speed up the implementation but at the same time restrict
`the field of application. Our implementations are up to 26% percent faster than
`previous implementations. Our work also shows how a simple change (inclu-
`sion of the MMX technology) in the analysis platform can influence the relative
`encryption speed of different ciphers. To enable everyone to compare their imple-
`mentations to ours, we also fully specify our procedures used to obtain the speed
`numbers.
`
`1
`
`Introduction
`
`For more than 20 years, DES [FIP77] has been a widely employed cryptographic stan-
`dard. While the best cryptanalytic attacks against DES (differential and linear cryptanal-
`ysis) are still highly impractical, during the last years DES has became obsolete for its
`too short key and block sizes, not withstanding the current advances in computing tech-
`nology. Motivated by this, NIST initiated a new effort to replace DES as a standard. 21
`algorithms were submitted and 15 algorithms were accepted as AES (Advanced Encryp-
`tion Standard) candidates, of which 5 candidates—MARS [BCD+98], RC6 [RRSY98],
`Rijndael [DR98], Serpent [ABK98], Twofish [SKW+99b]—were chosen to the second
`round.
`However, the AES process was started not only due to the theoretical reasons: there
`are a few well-known constructions, including 3DES, that seem to have very good secu-
`rity margins. Unfortunately, 3DES, based on the hardware-oriented DES, is unsatisfy-
`ingly slow on the modern 32- and 64-bit computer architectures: modern block ciphers
`are up to 10 times faster than 3DES. Regardless of these ciphers having unproven (even
`by time) security properties, they are widely used in the industry by pragmatic reasons:
`hardware applications like 1 GBits/s Ethernet or on-the-fly encryption of 160 MByte/s
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 001
`
`
`
`SCSI hard disks are requesting for faster ciphers. Clearly, the situation of having a
`(moderately) secure and (moderately) fast de jure standard DES, a (probably) secure
`and (clearly) slow de facto standard 3DES and some fast but with unknown security
`margin de facto standards is not acceptable: there should be a single standard that is
`both secure and fast. This is one of the reasons why, when inviting the public to pro-
`pose candidates for the AES, NIST explicitly stated that the new standard should be
`both “more secure and faster” than 3DES.
`While security of the candidates cannot be exactly quantified by the currently known
`methods, it seems to be easier to measure their speed. However, there is still a lot of
`ambiguity in answering the question what AES candidate is the fastest. Several pa-
`pers (including [Lip99,SKW+99a]) have compared AES candidates speed, but since
`the implementations quoted in them are often incomparable (or based on pure estima-
`tions), one cannot make direct conclusions about the efficiency of the ciphers based
`on the published papers. Incomparability stems from the different implementation as-
`sumptions, API’s, hardware (e.g., processors) and software (e.g., compilers) used by
`implementers. Even more, some of the timings presented in previous papers correspond
`to “show-case” (as opposed to practically applicable) implementations, some exam-
`ples of those being the fastest implementation of Twofish [SKW+99b] that uses self-
`modifying code and Brian Gladman’s implementations of AES candidates [Gla99] that
`use a number of key-specific static variables instead of allocating a register to address
`them, therefore effectively freeing some registers for other uses. Especially in the case
`of the Pentium family, where the number of available registers is very restricted, such
`implementations may result in a huge speed up. However, both types of implementation
`tricks restrict the application area of the implementation.
`In the current paper we try to give a satisfactory answer to the question “what cipher
`is the fastest on the Pentium II” by carefully optimizing the 4 fastest AES candidates—
`MARS, RC6, Rijndael and Twofish—in Pentium II assembly, using for all implementa-
`tions exactly the same, reasonable in practice, API and speed measurement conditions
`for all the ciphers. Due to this, our results are much fairer than most of the previously
`known ones: our implementations can be seen as black boxes applicable in almost any
`possible application of block ciphers on an environment featuring Pentium II. Addi-
`tionally, careful optimization process resulted in implementations that are clearly faster
`than the previously known implementations. (Except for Twofish, which has still a faster
`“show-case” implementation.)
`We start the paper by describing our platform of choice (Section 2), implementation
`philosophy and API (Section 3). Section 4 briefly surveys our results, and Section 5
`gives more details on the problems encountered when implementing the ciphers. More
`information about the Pentium II is given in the Appendices.
`
`2 Choice of the Platform
`
`Our first principal choice was the decision what processor to use. By purely pragmatic
`reasons we decided that the implementation environment equips an Intel Pentium family
`CPU: while this family is not the most modern processor family available, it is the most
`widespread one at the moment of writing this paper and most probably also during the
`
`2
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 002
`
`
`
`next few years. Therefore, since in the foreseeable future most of the software-based
`commercial security applications run on the Pentium family (as recognized also by the
`AES finalists designers), this family has the most direct impact on the choice of a cipher
`by security consumers.
`At second, from the Pentium family we decided to choose the Pentium II processor.
`At first, it is a more advanced processor than Pentium Pro, the NIST AES analysis
`platform: the Pentium II provides (twice) larger register space due to the added MMX
`technology, and many new MMX-specific commands. Compared to the Pentium Pro,
`the Pentium II is also easier to obtain at the current stage, since Pentium Pro has been
`out of the manufacturing for a while. On the other hand, the Pentium II was preferred
`by the authors to the Pentium III since the latter is somewhat too new and controversial
`due to the privacy issues.
`Another reason to choose Pentium II was that as the successor of the NIST AES
`analysis platform, implementing the AES candidates on the Pentium II could provide
`some insights on how generally suitable are the candidates, some of which were specif-
`ically optimized for the Pentium Pro, on future processors having features unpredicted
`by algorithm designers. While this is not as crucial as withstanding the “future attacks”,
`it still gives some ideas about the possible longevity of the cipher. (We clearly would
`not want the AES in 20 years to have the role the 3DES has today!)
`As shown in [Lip98], the MMX technology can seriously speed up IDEA ([LM90],
`[LMM94]), one of the believably most secure block ciphers with 64-bit block size. As
`stated in [Lip98], this can be done since IDEA has its key attributes similar to those
`of multimedia applications, for which the MMX technology was originally created. An
`open question posed in [Lip98] was how much would the MMX technology help imple-
`menting other ciphers, including the AES candidates. In the following we will partially
`answer to that question, showing that also some ciphers using only “simple” operations
`can greatly benefit from the added MMX technology. A short overview of Pentium II
`that is necessary for implementers and for cryptographers who design ciphers optimized
`for this platform is given in Appendix A. We refer for Intel manuals for a more complete
`overview.
`
`3
`
`Implementation Considerations
`
`Several papers (including, in particular, [Lip99,SKW+99a]) have compared AES can-
`didates speed, but since the implementations quoted in them are often incomparable (or
`based on pure estimations), one cannot make direct conclusions about the efficiency of
`these algorithms based on the published papers. Incomparability stems from the differ-
`ent implementation assumptions, API’s, hardware (processors) and software (compil-
`ers) platforms used by implementers. Even more, some of the numbers there correspond
`to the “show-case” (as opposed to practically applicable) implementations; including
`the bizarre case that one candidate was claimed to be the fastest on its inventors laptop
`under some suitable conditions.
`As another example of the unsuitability of some “show-case” implementations, the
`fastest implementation of Twofish [SKW+99b] uses self-modifying code and therefore
`cannot be used in a number of applications, while Brian Gladman’s implementations of
`
`3
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 003
`
`
`
`AES candidates [Gla99] use a number of key-specific static variables instead of allo-
`cating a register to address them, therefore effectively freeing some registers for other
`uses. Especially in the case of the Pentium family, where the number of available reg-
`isters is very restricted, such implementations may result in a huge speed up. On the
`other hand, Gladman’s implementations cannot be used several applications, including
`multithreaded programs and SMP (symmetric multi-processing) systems.
`Most of the security customers need however speed numbers applicable in whatever
`product they use in whatever environment in runs (for example, in a Linux kernel-
`supported IPSEC implementation, secure login or multithreaded access to encrypted
`storage arrays). For users it is necessary to know in what environment the measured
`speed numbers were obtained, to be able to calculate the possible efficiency of the
`ciphers in their own environments. Additionally, full specification is important for other
`implementers to be able to compare their implementations with ours. Hence, apart from
`providing “clean” implementations under some reasonable public assumptions, we shall
`also next fully specify these assumptions:
`
`– We do not use self-modifying code (“code compilation” [SKW+99b]) since it
`makes the implementation inapplicable in a number of situations, e.g., in operation-
`system kernel and ROM-based applications.
`– We additionally decided not to use key-specific static areas since then the imple-
`mentation could not be used, e.g., in SMP-capable systems and multithreaded pro-
`grams.
`– We decided to maximally use the MMX technology since it should not be forbidden
`in any reasonable modern environment. (While using self-modifying code and key-
`specific static areas is generally considered to be a bad programming practice.)
`– We decided to use exactly the same API (specified later in Section 3.1) in all our
`implementations.
`– A number of well-understood assumptions that 1) improve the speed and can be
`easily followed by implementers or 2) are essential to even be able to measure the
`speed:
`• All codes and data are correctly aligned.
`• Input and output texts and codes are preloaded to L1 cache in the possible
`extent to reduce the number of cache misses.
`• Simplicity of code: we tried to reduce time spent during writing and optimiz-
`ing the code. In particular, all our implementations use highly optimized but
`round-number independent round macros. (Hence, our results could be slightly
`bettered if every round would optimized separately to avoid, e.g., delays in
`fetching stage.)
`
`3.1 API
`
`Since a different API can be influence the speed of an implementation severely, we also
`decided to fully specify the API used by us to make for the other implementers easier
`to compare their implementations to the ours. We felt that this is necessary, since AES
`candidate implementations reported in [Lip99] vary greatly in their API’s.
`
`4
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 004
`
`
`
`void xxKS(char *master, uint32 bitLen, char *eKey);
`void xxEnc(char *inBlk, uint32 lenBlk, char *eKey,
`char *outBlk);
`void xxDec(char *inBlk, uint32 lenBlk, char *eKey,
`char *outBlk);
`
`where
`
`xx is algorithm name (e.g., Rijndael).
`xxKS is key scheduling subroutine.
`xxEnc is encryption subroutine that encrypts lenBlk blocks of plaintext starting from the
`address inBlk to the ciphertext location outBlk, by using extended key eKey, in ECB
`block cipher mode.
`xxDec is decryption subroutine with the same input conventions as xxEnc.
`uint32 is the type of 32-bit unsigned integers (in the case of Pentium II, equal to unsigned
`long in the case of most compilers).
`master is pointer to the master key bits.
`bitLen is the bit length of a master key.
`eKey is pointer to subkeys and other initialization data, used later by encryption and decryption.
`inBlk is pointer to input texts to be encrypted in the case of xxEnc and to be decrypted in the
`case of xxDec.
`outBlk is pointer to the corresponding output texts.
`lenBlk is number of blocks to be encrypted or decrypted.
`
`Fig. 1. Specification of our API.
`
`Note that our API, depicted in Figure 1, is essentially equivalent to the API’s used
`in most of the commercial applications, specifying only those inputs and outputs to the
`algorithms that are really needed by the algorithms. (Names of the subroutines and their
`parameters of course do not affect the speed, of course.) Our API was fixed for the key
`length of 128-bits due to the feeling that at the time when greater key sizes become
`necessary, our implementation platform would already be a history.
`Here, the key schedule and decryption subroutines are specified only for complete-
`ness. Since in the current paper we are not interested in the optimization of these sub-
`routines, we almost do not mention decryption and key schedules hereafter.
`
`3.2 How to Measure a Number of Cycles
`
`Different time measurement methods may change the speed numbers quite dramati-
`cally. As in the case of the API’s, we decided to use one, sensible published and fully
`specified convention (specified in Figure 2) for all the implementations. (Note that this
`wrapping corresponds almost exactly to the method specified in [Fog00], to which the
`reader is referred for a throughout explanation of the method.) The inputs and key of
`the cipher are generated randomly before the measurement begins, to prevent “opti-
`mization” for specific class of keys. The input variable lenBlk was chosen to be equal
`to 8000 so that the input and output texts would not fit in the L1 cache. Also, time is
`a work area of type uint32, used in later calculations.
`
`5
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 005
`
`
`
`movd mm0, dword ptr [time]; /* warm cache and set MMX state */
`xor eax, eax;
`cpuid;
`rdtsc;
`mov dword ptr [time], eax;
`xor eax, eax;
`cpuid;
`/* xxEnc() or xxDec() */
`xor eax, eax;
`cpuid;
`rdtsc;
`sub dword ptr [time], eax;
`emms;
`
`/* serialize instructions */
`/* read time-stamp counter */
`/* save counter */
`
`/* serialize instructions */
`
`/* serialize instructions */
`/* read time-stamp counter */
`/* compute the difference */
`/* empty MMX state */
`
`Note that time is a 4 bytes work area.
`
`Fig. 2. Time measurement code
`
`/* push all used registers */
`cmp dword ptr [lenBlk], 0;
`jz L1;
`align 16;
`
`L0:
`
`L1:
`
`dec dword ptr [lenBlk];
`jnz L0;
`
`/* pop these registers once more */
`
`Fig. 3. Null function
`
`Note that this method has some overhead, due to both high latency of the rdtsc
`instructions and also the overhead caused by looping instructions like jnz which are
`not formally part of the cipher itself. (Looping instructions can be seen as a part of
`the block cipher mode, however.) We measure this overhead by using the null function
`shown in Fig. 3 obtaining a value nulltime, and then we subtract it from the value of
`time obtained by measuring the speeds of different encryption/decryption procedures.
`Finally, this result is divided by the number of blocks encrypted. Intuitively, by using
`this method we obtain the number of cycles corresponding to unrolled implementation
`of the block cipher, or to the implementation where we only care about the time en-
`crypting one block takes without adding any extra overhead. (Note that the subtracted
`overhead number was equal to ≈ 6 in the case n = 8000. One could easily add this
`number to those presented later to get the number of cycles with overhead.)
`Chosen time measurement method is also reasonable in practice: when the value
`of lenBlk was chosen to be different, for most of the implementations (including
`the implementation of null cipher), the execution times increased by almost the same
`constant. Hence, the null cipher proved experimentally to be well-defined.
`
`6
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 006
`
`
`
`Cipher
`
`Best previous result Speedup
`
`Cycles per
`Mbits/s on a 450
`block
`MHz Pentium II
`—
`6
`Null cipher —
`243 [Riv98]
`223
`RC6
`258 Mbits/s
`320 [DR98]
`237
`Rijndael
`243 Mbits/s
`315 [SKW+99b]
`282
`Twofish
`204 Mbits/s
`I
`I
`I
`I
`I
`I
`390 [BCD+98]
`306
`MARS
`188 Mbits/s
`Table 1. Performance in clock cycles per block of output of four AES finalists. (Only encryption
`considered)
`
`—
`8%
`26%
`11%
`22%
`
`Finally, we did a loop of 500 times over the described measurements and then chose
`the smallest number for every cipher, since that corresponds most likely to the case
`where most of the data and code are in L1 cache and the branch prediction works suc-
`cessfully: i.e., to the bulk encryption speed of the cipher itself.
`
`4
`
`Implementation Results
`
`From the five AES finalists, one (Serpent) is regarded as a very conservative design
`but at the same time also being clearly slower than the other AES finalists. Rest of the
`finalists have comparable timings on most of the modern computer platforms, where
`one of the ciphers is the fastest in one platform, and another one in another platform.
`Since also on the Pentium II processor, Serpent seems to be very slow by the published
`data, we decided postpone its implementation to the future and concentrate on the fast
`ciphers.
`Timings, obtained by measuring the speed of implementations by following pre-
`viously specified procedures are summarized in Table 11. The numbers in the middle
`columns show how many cycles it takes to encrypt one 128-bit block by using the cho-
`sen cipher with a 128-bit key. These results indicate that the chosen four AES finalists
`are extremely fast. For comparison, the standard hash algorithm SHA-1 hashes a 512-
`bit block in 837 cycles (i.e., 13.1 cycles per byte) and DES and 3DES encrypt a 64-bit
`block respectively in 340 and 928 cycles (resp., 42.5 and 116 cycles per byte) [PRB98],
`while RC6 and Rijndael respectively encrypt a 128-bit block in 223 and 237 cycles
`(resp., 13.9 and 14.8 cycles per byte). However, note that the cited timings in [PRB98]
`were obtained on a plain Pentium and therefore could most probably be improved on
`the Pentium II.
`Our results seem to indicate, that the speed difference between different ciphers is
`less than expected: as before, RC6 is still the fastest cipher on the Pentium II, but the
`difference between it and Rijndael has decreased seriously. Hence we hesitate to say
`that RC6 is the fastest cipher. However, based on the cited results, we can classify the
`ciphers to two groups: blastingly fast ciphers RC6 and Rijndael and somewhat slower,
`but still very fast ciphers Twofish and MARS.
`1 We also started to code the decryption routines, finishing RC6 decryption (209 cycles per
`block) and Twofish decryption (276 cycles per block).
`
`7
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 007
`
`
`
`However, one has to keep in mind that RC6 and MARS have design features that
`make them specifically efficient on the Pentium Pro (and its successors), while their
`performance seriously degrades on other processors [Lip99,SKW+99a]. This is due to
`the use of complex instructions (32-bit multiplication and data-dependent rotation) that
`are cheap on the P6 family (Pentium Pro, Pentium II, Celeron, Xeon and Pentium III)
`but very expensive on most of the other platforms. Interestingly, also the next generation
`Pentium processor (code-named “Willamette”, [Int00]) has latency 10 multiplication
`and latency 2 or 4 shifts, as compared to latency 4 multiplication and latency 1 shifts on
`the P6 family [Int00, Section 4.1.3]. Hence, RC6 and MARS would considerably slow
`down on the Willamette, the next generation Pentium family processor. On the other
`hand, Rijndael and Twofish are based on simple operations, and run equally well on
`all platforms. The speed ratio between Rijndael and Twofish seems be remain almost
`the same on the other platforms [Lip99] (namely, Rijndael being 5 . . . 25% faster than
`Twofish).
`Note that the speed up percents in Table 1 correspond to the achieved speed ups
`compared to the fastest “clean” implementations (i.e., those not using key-specific static
`data or self-modifying code). However, these percents do not always mean that our
`implementation techniques were exactly as much better. For example, the best previous
`implementation of Rijndael was done for the plain Pentium, but not for the Pentium Pro:
`a factor that may have negatively affected its performance. The best previous “clean”
`implementation of MARS was written in C, and therefore had also a relatively slow
`performance. However, our own C implementation of MARS is clearly faster than the
`one given in Table 1. In the case of Rijndael, most of the acceleration Rijndael is due
`to the efficient use MMX technology. In general, speed up comes mainly from better
`optimization (elaborated tradeoff between processor operating stages) and full usage of
`the Pentium II possibilities (i.e., the MMX technology).
`To further clarify how does the Pentium II architecture impact the speed, Table 2
`shows the detailed information of our implementations in encryption mode in the micro-
`operation level. Usage of the table is simple. For example, in the intersection point of
`“@round” row and “port 01” column in TwofishEnc table one would find 19. That
`means that there are 19 µoperations in the round function of TwofishEnc which will
`be executed on port 0 or port 1.
`Interestingly, our implementations of MARS, Rijndael and Twofish all require ap-
`proximately the same number of µoperations, while RC6 is about two times “better”
`in this category. On the other hand, RC6 is also the worst cipher to parallelize: while
`in Rijndael, more than 2.5 µoperations are executed per a cycle, RC6 can only mildly
`use the super-scalar parallelism of Pentium II. More cipher-specific comments will be
`given in the next.
`
`5 Cipher-Specific Comments
`
`5.1 MARS
`
`In the case of MARS [BCD+98], the speed difference between a carefully optimized
`C implementation (using a recent snapshot of the gcc compiler) and an optimized as-
`sembly language implementation is only about 11% on the Pentium II. The speedup
`
`8
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 008
`
`
`
`MARS encryption
`prewhitening
`forward mixing
`@core (×16)
`backward mixing
`postwhitening
`total
`RC6 encryption
`prewhitening
`@round (×20)
`postwhitening
`total
`Rijndael encryption
`whitening
`@round (×9)
`last round
`total
`Twofish encryption
`prewhitening
`first round
`@round (×15)
`postwhitening
`total
`
`II
`
`II
`
`II
`
`port 0 port 1 port 01 port 2 port 3 port 4 total
`(1.87 µops/cycle)
`13
`125
`18
`125
`21
`4
`4
`4 572
`4
`(1.47 µops/cycle)
`9
`15
`20
`5
`5
`5 329
`5
`(2.54 µops/cycle)
`15
`58
`64
`3
`3
`3 601
`3
`(2.11 µops/cycle)
`13
`34
`35
`23
`4
`4 595
`
`16
`6
`16
`
`128
`
`8
`
`1
`1
`
`160
`
`I
`
`1
`1
`
`I
`
`1
`1
`3
`13
`
`I
`
`I
`
`I
`
`1
`1
`
`I
`
`4
`4
`40
`
`5
`6
`2
`97
`
`5
`77
`9
`85
`8
`319
`
`2
`5
`4
`106
`
`8
`34
`31
`345
`
`5
`19
`19
`8
`317
`
`8
`32
`3
`32
`4
`124
`
`7
`2
`5
`52
`
`I
`
`6
`19
`20
`197
`
`I
`
`8
`10
`10
`4
`172
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`4
`4
`
`I
`
`II
`
`II
`
`II
`
`Table 2. Number of µoperations in our implementations
`
`comes mainly from a slightly more efficient allocation of the integer registers and some
`(minimal) usage of the MMX instructions in the assembly implementation. However,
`the MMX technology is only moderately useful for MARS, since the complex instruc-
`tions performed in MARS (i.e., 32-bit multiplication, data-dependent rotation and S-
`box lookups) are not available for the MMX registers. Additionally, due to the high
`data-dependency there is very limited freedom in meaningfully rescheduling the in-
`structions in MARS, which also means that one cannot avoid all the delays on all the
`processor operating stages.
`Another drawback is that during MARS encryption, some execution ports are con-
`siderably more overloaded than others. Namely, more than 78% of µoperations go either
`to port 0 or 1. The most overloaded is port 0, since 128 µoperations go only to this port
`— including 16 multiplications and extensively used rotations.
`
`5.2 RC6
`
`From implementers point of view, problems arising when optimizing an RC6 imple-
`mentation are similar to those arising when coding MARS in many aspects: both ci-
`phers rely on the same complex instructions, have long critical paths and overloaded
`
`9
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 009
`
`
`
`port 0. However, since RC6 uses multiplications even more extensively, it is even less
`parallelizable. Table 2 shows that our implementation includes 160 port 0 µoperations,
`which includes 40 multiplications with latency 4.
`RC6 is a very Pentium II-friendly cipher, and it is very easy to code it even in the
`assembly language. It can also be very efficiently implemented in C: the speed differ-
`ence between a C implementation and an assembly implementation is about 18%. (The
`difference is bigger than in the case of MARS since gcc, the test compiler, performs
`very poorly in translating the quadratic formulas of type x · (2x + 1) to the Pentium II
`assembly language.) It is straightforward to obtain an optimized assembly language
`implementation from the C implementation: one does not have many possibilities to
`reschedule the code.
`
`5.3 Rijndael
`
`As opposed to MARS and RC6, Rijndael [DR98] is not C-friendly (at least not gcc-
`friendly) in the sense that assembly implementation is about 44% slower than gcc-
`implementation of the same cipher. It is however mainly due to the inefficiency of the
`gcc compiler: our implementation of Rijndael makes very heavy use of the MMX
`technology, but also of 8-bit instructions provided by Pentium family. However, gcc
`cannot efficiently use either of these.
`Rijndael can effectively use the MMX since Rijndael is based only on most simple
`imaginable operations (load, xor), all of which are supported by the MMX technol-
`ogy. Additionally, since Rijndael has large internal parallelism (at least four-times, but
`partially up to 16-times parallelism!), there is a large number of possibilities to resched-
`ule its code. Our implementation was obtained by doing so in a way that all the delays
`in the different stages of the Pentium II operation would be minimized. The final result
`is very impressive for the Pentium II: it executes 2.54 µoperations per a cycle.
`Not the last factor that makes Rijndael suitable for the Pentium II is the fact that
`almost exactly one third of the µoperations in our implementation of Rijndael go to
`port 2, while the remaining 2/3 of µoperations go to ports 0 and 1. Due to this and
`parallelism we get that during the Rijndael encryption 3 µoperations could be executed
`in parallel almost all the time. However, this (not to mention other aspects like decoding
`and fetching delays) also makes 20 cycles per round a lower bound for Rijndael and
`shows that our result may be very close to the optimal one. To facilitate more efficient
`implementations, the Pentium II should feature three ALUs, two concurrent memory
`access ports and also more decoders and retirement units: features that are not cipher-
`specific and would improve the speed of most of the applications.
`Finally, we measured the timings of r-round Rijndael for variable r without any
`additional fine-tuning: those implementations are unoptimized since they use the same
`round macros as the 10-round Rijndael without any additional effort to optimize them
`to reduce, say, fetching delays. In particular it turned out that 8-round Rijndael (essen-
`tially equivalent to the cipher Square [DKR97] from the implementers point of view)
`encrypts a block in 193 cycles. 192-bit Rijndael (12 rounds) took 286 cycles, and 256-
`bit Rijndael (14 rounds)—333 cycles. Note that since 12-round Rijndael is very similar
`to Crypton [Lim98], 286 cycles is also a (hopefully) close approximation for the speed
`of latter.
`
`10
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 010
`
`
`
`5.4 Twofish
`
`Twofish is designed to be well-suited on multiple platforms, including also the Pen-
`tium II. From the implementers point of view it resembles Rijndael in many aspects, by
`using only simple instructions but also some large-scale components of the latter (e.g.,
`MDS, to provide diffusion). Due to the use of low-level instructions, Twofish is also
`relatively slow in C compared to the assembly (the difference is about 37%).
`Main difference for implementers between Rijndael and Twofish is the inclusion
`of the Pseudo-Hadamard Transformation that somehow complicates Rijndael’s clear
`structure and makes it less parallelizable: while the number of µoperations in our im-
`plementation of Twofish is less than in our implementation of Rijndael, it turned out
`to be very difficult to use the MMX technology to optimize Twofish. Hence, Twofish
`is only moderately parallelizable, although the parallelism of our implementation (2.11
`µoperations per cycle) is relatively good.
`
`6 Conclusion and Work in Progress
`
`We achieved the fastest implementations of four of the AES finalists on the Pentium II
`processor, obtaining speedup 8% . . . 26% compared to the previously known implemen-
`tations. Since all implementations were coded by using the same sensible assumptions,
`they provide a more adequate efficiency comparison of the AES finalists than the pre-
`vious papers. We demonstrated that MMX can be quite efficiently used to speedup
`Rijndael, but is only moderately useful for other ciphers. (However, our implemen-
`tations depend on the availability of MMX technology to a lesser or greater extent
`and in general do not run on the Pentium Pro.) We provided full specification on our
`time-measurement conditions to simplify for the future implementers to compare their
`implementations to ours.
`Our implementations are not the final: we continue optimizing them. Up-to-date
`results will be available at the AES efficiency table [Lip99].
`
`References
`
`[ABK98]
`
`Ross Anderson, Eli Biham, and Lars Knudsen. Serpent: A Flexible Block Cipher
`With Maximum Assurance. In The First Advanced Encryption Standard Candidate
`Conference, Ventura, California, USA, 20–22 August 1998.
`[BCD+98] Carolynn Burwick, Don Coppersmith, Edward D’Avignon, Rosario Gennaro,
`Shai Halevi, Charanjit Jutla, Stephen M. Matyas Jr., Luke O’Connor, Moham-
`mad Peyravian, David Safford, and Nevenko Zunic. MARS — A Candi-
`date Cipher for AES. Original paper and a tweak to it are available from
`http://www.research.ibm.com/security/mars.html, June 1998.
`Joan Daemen, Lars Knudsen, and Vincent Rijmen. The Block Cipher Square. In
`Eli Biham, editor, Fast Software Encryption ’97, volume 1267 of Lecture Notes in
`Computer Science, pages 149–165, Haifa, Israel, January 1997. Springer-Verlag.
`Joan Daemen and Vincent Rijmen. The Block Cipher Rijndael. In Third Smart Card
`Research and Advanced Applications Conference Proceedings, 1998. To appear.
`
`[DKR97]
`
`[DR98]
`
`11
`
`MOBILEIRON, INC. - EXHIBIT 1009
`Page 011
`
`
`
`[FIP77]
`
`[Fog00]
`
`[Gla99]
`
`[Int99]
`
`[Int00]
`
`[Lim98]
`
`[Lip98]
`
`FIPS. Data Encryption Standard. Technical report, U.S. Department of Com-
`merce/National Bureau of Standards, National Technical Information Service,
`Springfield, Virginia, 1977. FIPS 46.
`Agner Fog. How to Optimize for the Pentium Microprocessors. Available from
`http://www.agner.com/assem/, 11 March 2000.
`Brian Gladman. AES algorithm efficiency. Unpublished. Information available
`from http://www.btinternet.com/˜brian.gladman/ cryptogra-
`phy technology/, January 1999.
`-
`Intel.
`Intel Architecture Optimization. Reference Manual, 1999. Order Number
`245127-001.
`Intel. Willamette Processor Software Developer’s Guide, February 2000. Order
`Number 245355-001.
`Specification and Analysis of CRYPTON Version 1.0.
`Chae Hoon