`
`A Survey of Lightweight-
`Cryptography
`Implementations
`
`Thomas Eisenbarth
`Ruhr University Bochum
`
`Sandeep Kumar
`Philips Research Europe
`
`Christof Paar and Axel Poschmann
`Ruhr University Bochum
`
`Leif Uhsadel
`Catholic University of Leuven
`
`Editor’s note:
`The tight cost and implementation constraints of high-volume products,
`including secure RFID tags and smart cards, require specialized crypto-
`graphic implementations. The authors review recent developments in this
`area for symmetric and asymmetric ciphers, targeting embedded hardware
`and software.
`
`—Patrick Schaumont, Virginia Tech
`
`three
`to optimize all
`very difficult
`design goals at once. For example,
`a secure and high-performance hard-
`ware implementation can be achieved
`by a pipelined, side-channel-resistant
`architecture, resulting in a high area
`requirement, and thus high costs. On
`the other hand, it’s possible to design
`a secure, low-cost hardware implemen-
`tation with the drawback of limited performance.
`In this article, we present a selection of recently
`published lightweight-cryptography implementations
`and compare them to state-of-the-art results in their
`field. This survey covers recent hardware and software
`implementations of symmetric as well as asymmetric
`ciphers. We will discuss software and hardware
`implementations separately, because they have differ-
`ent and sometimes contrary characteristics. For
`example, bit permutations are virtually free in
`hardware, whereas in software they can significantly
`slow down implementations. Also, large substitution
`tables are often software friendly, but hardware
`realizations can be relatively costly. Finally,
`the
`evaluation metric is different: For software implemen-
`tations, we compare both RAM and ROM requirements
`and the required number of clock cycles. For
`hardware implementations, we focus on the required
`chip size and the number of clock cycles. We don’t
`compare power consumption for
`the hardware
`implementations, because different standard-cell tech-
`nologies were used and estimates from simulating
`environments are not accurate. Software implementa-
`tions let us achieve a rough estimate of power
`
`&THE UPCOMING ERA of pervasive computing will be
`
`characterized by many smart devices that—because
`of the tight cost constraints inherent in mass deploy-
`ments—have very limited resources in terms of
`memory, computing power, and battery supply. Here,
`it’s necessary to interpret Moore’s law differently:
`Rather than a doubling of performance, we see
`a halving of the price for constant computing power
`every 18 months. Because many foreseen applications
`have extremely tight cost constraints—for example,
`RFID in tetrapacks—over
`time, Moore’s law will
`increasingly enable such applications. Many applica-
`tions will process sensitive health-monitoring or bio-
`metric data, so the demand for cryptographic compo-
`nents that can be efficiently implemented is strong and
`growing. For such implementations, as well as for
`ciphers that are particularly suited for this purpose, we
`use the generic term lightweight cryptography in this
`article.
`Every designer of lightweight cryptography must
`cope with the trade-offs between security, cost, and
`performance. It’s generally easy to optimize any two of
`the three design goals—security and cost, security and
`performance, or cost and performance; however, it is
`
`522
`
`0740-7475/07/$25.00 G 2007 IEEE
`
`Copublished by the IEEE CS and the IEEE CASS
`
`IEEE Design & Test of Computers
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 1
`Juniper v Huawei
`
`
`
`consumption by multiplying the processing time by
`the average power consumption of the target device.
`Another distinction is between symmetric and
`asymmetric ciphers, because the latter offer more
`security functionality and therefore have different
`application scenarios. Symmetric ciphers serve mainly
`for message integrity checks, entity authentication,
`and encryption, whereas asymmetric ciphers addi-
`tionally provide key-management advantages and
`nonrepudiation. Asymmetric ciphers are computation-
`ally far more demanding,
`in both hardware and
`software. The performance gap on constrained
`devices such as 8-bit microcontrollers is huge. For
`example, an optimized asymmetric algorithm such as
`elliptic-curve cryptography (ECC) performs 100 to
`1,000 times more slowly than a standard symmetric
`cipher such as the Advanced Encryption Standard
`(AES) algorithm, which correlates with a two- to three-
`orders-of-magnitude higher power consumption. Un-
`like block ciphers, which are well investigated and
`understood,
`stream ciphers have received little
`attention from the scientific community. Although this
`has recently started to change, we cite stream ciphers
`only for comparison. (The increasing interest in stream
`ciphers is apparent in projects such as eStream, within
`the European Network of Excellence in Cryptography,
`which aims to foster knowledge about stream ciphers.)
`The ‘‘Related work’’ sidebar summarizes some recent
`developments in lightweight cryptography.
`
`Symmetric ciphers
`Many works on symmetric ciphers have been
`published during the past two decades. Because most
`main applications of symmetric ciphers use software
`implementations,
`it’s no surprise that nearly all
`algorithms—for example, the AES—have been de-
`veloped with good software performance in mind. The
`paradigm shift that we foresee will likely lead to an
`increasing demand for lightweight ciphers that per-
`form well in hardware. Therefore, we focus here on
`recently published works on ciphers that have been
`developed for minimal hardware requirements—
`namely, DESL1 and Present.2
`
`Hardware implementations of symmetric ciphers
`The only well-established cipher designed with
`a strong focus on low hardware cost is the Data
`Encryption Standard (DES). Comparing a standard
`one-round implementation of AES and DES, we find
`that the latter consumes only about 6% of the logic
`
`November–December 2007
`
`resources of AES and has a shorter critical path.
`However, researchers have described a low-power,
`low-cost AES implementation that requires only 3,400
`gate equivalents (GEs) and encrypts a plaintext within
`1,032 clock cycles.3 This impressive result seems to
`have achieved the limit in area minimization. This
`implementation, as well as the implementations of
`DESL and Present, features encryption only, because
`encryption is sufficient for many lightweight target
`applications, such as authentication with a challenge-
`response protocol.
`
`Inspired by the one-round implementation
`DES.
`results of AES and DES, we implemented a serialized
`version of DES that processes 4-bit and 6-bit data words
`rather than those with 32 bits and 48 bits. Our
`implementation requires 2,310 GEs and encrypts
`a plaintext within 144 clock cycles.1 To our
`knowledge,
`this
`is
`the smallest
`reported DES
`implementation, sacrificing throughput
`to achieve
`minimal area requirements. However, the 56-bit key
`limits the security provided. Brute-forcing this key
`space using software takes a few months and
`hundreds of PCs, but only a few days with a special-
`purpose machine such as Copacobana.4 Hence, this
`implementation is relevant only for applications
`needing short-term security or where the values
`protected are relatively low.
`In certain low-cost
`applications, such a security level is adequate. When
`a higher security level
`is needed, so-called key
`whitening, can be added to standard DES, yielding
`DESX. The key-whitening technique requires only two
`additional XOR gates: one gate to add a prewhitening
`key to the plaintext before the cipher processes it, and
`another to add a postwhitening key to the resulting
`ciphertext. In the case of DES, this enlarges the key
`space from 56 bits to 184 bits. However, because of
`time-memory trade-offs (birthday attack), the security
`level of DESX is bounded by 118 bits.
`
`DESL and DESXL. In our serialized DES implemen-
`tation, substitution boxes (S-boxes) take up approxi-
`mately 32% of the area. We can further decrease the
`gate complexity of DES by replacing the eight original
`S-boxes with a single new one, eliminating seven S-
`boxes as well as the multiplexer. This lightweight DES
`variant is called DESL and results in an approximately
`20% smaller chip than DES (1,850 GEs versus 2,310
`GEs). The S-box has been carefully selected and highly
`optimized, enabling DESL to resist common attacks
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`523
`
`Juniper Ex. 1043-p. 2
`Juniper v Huawei
`
`
`
`Design and Test of ICs for Secure Embedded Computing
`
`Related work
`There are many recent symmetric ciphers—for
`and 8051 (the Chipcon CC1010 platform). They show
`example, Hight, Clefia, DESXL,1 and Present2—with
`that for the Atmel ATmega128 clocked at 8 MHz, a point
`special implementation properties proposed. Hight was
`multiplication using a 160-bit ECC GF(p) standard
`first presented at the 2006 Workshop on Cryptographic
`curve required 0.81 second. A security equivalent
`Hardware and Embedded Systems and was designed
`1,024-bit RSA encryption requires about 11 seconds.
`with good hardware performance in mind.
`In their
`Consequently, this article considers only ECC in the
`paper, the authors provide hardware figures for a one-
`asymmetric case. Although hyperelliptic curves hold
`round implementation—that is, one round is performed
`promise too, their lack of standardization makes them
`in one cycle—and they conclude that Hight is well-
`less promising at
`the moment. Optimum extension
`suited for ubiquitous computing devices such as
`fields (OEFs), which can be parameterized for small
`wireless sensor nodes and RFID tags. Their figures
`CPUs, offer an alternative method for
`fast ECC
`implementations.4 Several hardware implementations
`show that Hight requires approximately the same chip
`for standardized ECC have been suggested, but few
`size as the Advanced Encryption Standard (AES)
`are aimed at low-end devices. Most implementations
`algorithm (3,048 versus 3,400 gate equivalents, or
`focus on speed and, owing to their huge area
`GEs) but is much faster. However, figures for imple-
`requirements, are suitable mostly for server-end appli-
`mentations with a smaller footprint in hardware are not
`cations only.5
`yet available. Clefia was designed with a broader
`application range in mind—to perform well
`in both
`hardware and software implementations. Two ciphers
`especially optimized for software architectures are the
`Tiny Encryption Algorithm (TEA) family and the In-
`ternational Data Encryption Algorithm (IDEA). They
`consist only of arithmetic operations on 16-bit words
`(IDEA) and 32-bit words (TEA).
`IDEA consists of
`addition, XOR addition, and multiplication. The TEA
`family uses only addition, XOR addition, and shifts. In
`both cases, each operation can be implemented
`efficiently on 8-bit platforms. Neither cipher uses
`a substitution box (S-box), so they don’t need much
`memory. The Scalable Encryption Algorithm (SEA) can
`be parameterized according to processor size as well
`as plaintext size and key size; the goal
`is to enable
`efficient implementations on different platforms.
`With respect
`to asymmetric algorithms on small
`processors, Gura et al.3 make a key contribution,
`comparing RSA (Rivest, Shamir, Adleman) and elliptic-
`curve cryptography (ECC) on two different, commonly
`used 8-bit CPUs: AVR (the Atmel ATmega128 platform)
`
`References
`
`1. G. Leander et al., ‘‘New Lightweight DES Variants,’’ Proc.
`
`Fast Software Encryption (FSE 07), LNCS 4593, Springer-
`
`Verlag, 2007, pp. 196-210.
`
`2. A. Bogdanov et al., ‘‘PRESENT: An Ultra-Lightweight Block
`
`Cipher,’’ Proc. Workshop Cryptographic Hardware and
`
`Embedded Systems (CHES 07), LNCS 4727, Springer,
`
`2007, pp. 450-466.
`
`3. N. Gura et al., ‘‘Comparing Elliptic Curve Cryptography and
`
`RSA on 8-bit CPUs,’’ Proc. 6th Int’l Workshop Cryptograph-
`
`ic Hardware and Embedded Systems (CHES 04), Springer-
`
`Verlag, LNCS 3156, 2004, pp. 119-132.
`
`4. A. Woodbury, D.V. Bailey, and C. Paar,
`
`‘‘Elliptic Curve
`
`Cryptography on Smart Cards without Coprocessors,’’
`
`Proc. 4th Working IFIP Conf. Smart Card Research and
`
`Advanced Applications (CARDIS 2000), Kluwer Academic,
`
`2000, pp. 71-92.
`
`5. L. Batina et al., ‘‘Hardware Architectures for Public Key
`
`Cryptography Integration,’’ VLSI J., vol. 34, no. 6, 2003,
`
`pp. 1-64.
`
`such as linear and differential cryptanalysis and the
`Davies-Murphy attack. Thus, DESL achieves a security
`level
`appropriate
`for many
`applications. Key
`whitening can be applied to strengthen the cipher,
`yielding the DESXL cipher, with a security level of
`approximately 118 bits. DESXL requires 2,170 GEs and
`encrypts a plaintext within 144 clock cycles. (Further
`details are available elsewhere.1)
`
`Present. Besides efficiently implementing or slightly
`modifying an established cipher, an alternative for
`lightweight cryptography is to design a new hardware-
`optimized cipher from scratch. We followed this
`approach when designing Present, an SPN-based
`(substitution permutation network) block cipher
`with 32 rounds, a block size of 64 bits, and a key size
`of 80 or 128 bits. The main design philosophy was
`
`524
`
`IEEE Design & Test of Computers
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 3
`Juniper v Huawei
`
`
`
`Figure 1. Data path of the Present implementation. (S: substitution layer; P:
`
`permutation layer.) (Source: Bogdanov et al.2)
`
`simplicity: No part of the cipher was
`added without a good reason, such as
`thwarting an attack. Figure 1 depicts
`Present’s very simple design. Let’s take
`the round counter XOR in the key
`scheduling as an example of how to
`thwart a whole class of attacks with
`minimal
`area overhead. Hardware
`designers favor repetition because it
`lets them reuse parts of the chip and
`hence reduce chip size. However, if the
`key-update functions are similar in each
`round, this property can be exploited by
`related-key attacks. Among the possi-
`bilities for achieving deviating round functions, we
`chose the one that is most hardware efficient and
`simple: adding a round-dependent constant in the key
`scheduling. Because a round counter
`is needed
`anyway,
`this
`constitutes
`no
`additional
`area
`requirements; the XOR requires only 13 GEs.
`Present, like any other SPN, comprises three stages:
`a key-mixing step, a substitution layer, and a permuta-
`tion layer. For the key mixing, we chose a simple XOR
`because this operation can be implemented efficiently
`in both hardware and software. The key schedule
`consists essentially of a 61-bit rotation together with an
`S-box and a round counter. (Present-80 uses a single S-
`box, whereas Present-128 requires two S-boxes.) The
`substitution layer comprises 16 S-boxes with 4-bit
`inputs and 4-bit outputs (4 3 4). We decided to use
`similar S-boxes in both the data path and the key
`scheduling because we learned from DESL that this
`can result in significant area savings when a serialized
`implementation is desired. The choice of 4 3 4 rather
`than 8 3 8 S-boxes was also hardware driven; 8-bit
`S-boxes require about 40 times more area than 4-bit
`S-boxes (1,000 GEs versus 25 GEs). However, 4-bit
`S-boxes must be selected very carefully because they
`are cryptographically weaker
`than 8-bit S-boxes.
`Nevertheless, through careful selection, it’s possible
`to achieve an appropriate security level. The permu-
`tation layer is a very regular and simple bit trans-
`position. It comes virtually free in hardware because it
`is realized by simple wiring and hence needs no
`transistors. The permutation layer ensures that an S-
`box’s four output bits will be distributed to four distinct
`S-boxes in the following round, which ensures the
`avalanche effect. This is required to thwart linear and
`differential cryptanalyses. (Further details are available
`elsewhere.2)
`
`Table 1 compares the implementation results of
`various lightweight ciphers.
`
`Software implementations of symmetric ciphers
`The many design choices within the AES reflect the
`wide discussion of efficient symmetric cryptography
`for software implementations. Yet,
`for constrained
`devices in particular, designers must take into account
`the target platform’s special properties when choosing
`cryptographic algorithms.
`In many areas where cost and energy considera-
`tions dominate, computational power comes in the
`form of a small, inexpensive CPU. By a wide margin,
`8-bit controllers have the largest share of the worldwide
`CPU market. These small microcontrollers are con-
`strained in program memory (flash or ROM), RAM,
`clock speed, register width, and arithmetic capabilities.
`In this context, efficiency means more than simply
`throughput: Resources needed to implement a cipher
`should be kept small.
`In fact,
`in many situations,
`resource efficiency (measured mainly by memory
`consumption) is more critical than throughput, espe-
`cially because many embedded applications encrypt
`only small payloads. Typically, these 8-bit microcon-
`trollers offer as little as tens of kilobytes of program
`memory, and sometimes less than 1 Kbyte of SRAM;
`they usually operate at clock speeds of a few MHz.
`Nevertheless, especially for battery-powered de-
`vices, low computational complexity can be of great
`value too because processing time directly correlates
`with power consumption. Modern microcontrollers
`can enter a variety of power-down and power-saving
`modes as soon as they have finished computation.
`Hence, a fast-executing algorithm can reduce energy
`consumption and lengthen the lifetime of a battery-
`powered device.
`
`November–December 2007
`
`525
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 4
`Juniper v Huawei
`
`
`
`Design and Test of ICs for Secure Embedded Computing
`
`Table 1. Comparison of lightweight ciphers.
`
`Key
`
`bits
`
`80
`
`128
`
`128
`
`128
`
`96
`
`56
`
`184
`
`Cipher
`
`Block ciphers
`
`Present
`
`AES
`
`Hight
`
`Clefia
`
`mCrypton
`
`DES
`
`DESXL
`
`Stream ciphers
`
`Block
`
`bits
`
`Cycles per
`
`Throughput at
`
`Logic
`
`block
`
`100 kHz (Kbps)
`
`process
`
`64
`
`128
`
`64
`
`128
`
`64
`
`64
`
`64
`
`32
`
`1,032
`
`34
`
`36
`
`13
`
`144
`
`144
`
`200.00
`
`12.40
`
`188.20
`
`355.56
`
`492.30
`
`44.40
`
`44.40
`
`0.18 mm
`0.35 mm
`0.25 mm
`0.09 mm
`0.13 mm
`0.18 mm
`0.18 mm
`
`Area
`
`(GEs)
`
`1,570
`
`3,400
`
`3,048
`
`4,993
`
`2,681
`
`2,309
`
`2,168
`
`Trivium5
`
`80
`
`1
`
`0.13 mm
`0.13 mm
`Grain5
`1,294
`100.00
`1
`1
`80
`*AES: Advanced Encryption Standard; DES: Data Encryption Standard; DESXL: lightweight DES with key whitening.
`
`1
`
`100.00
`
`2,599
`
`In this context, we compare the previously
`discussed ciphers that are primarily optimized for
`hardware with several software-friendly ciphers. For
`comparison, we added two software-oriented stream
`ciphers of the eStream project: Salsa20 and LEX.6 The
`latter is a modified AES in which several bytes of the
`intermediate states are extracted and used as a key
`stream. Salsa20, like the Tiny Encryption Algorithm
`(TEA) and the International Data Encryption Algo-
`rithm (IDEA), is based on simple arithmetic opera-
`tions. We chose these ciphers because their consump-
`tion of ROM and RAM is suited for small embedded
`processors. Stream ciphers usually have lengthy setup
`phases. LEX needs only one AES encryption for setup,
`and Salsa20’s setup phase is even shorter. Hence, these
`ciphers can provide efficient encryption of the small
`payloads often found in embedded systems. All the
`discussed ciphers were implemented for 8-bit AVR
`microcontrollers. AVRs are a popular family of 8-bit
`RISC microcontrollers. The ATmega family offers 8
`Kbytes to 128 Kbytes of flash memory and 1 Kbyte to 8
`Kbytes of SRAM. The devices of the ATmega series
`have 32 general-purpose registers with a word size of 8
`bits. Most of the microcontrollers’ 130 instructions are
`one cycle, and the microcontrollers can be clocked at
`up to 16 MHz.
`To keep the source code small, we used a straight-
`forward approach for all our software implementa-
`tions. Only the substitution tables are realized as
`lookup tables (LUTs), where applicable, because this
`provides an enormous speedup for reasonable mem-
`ory consumption. Many fast software implementations
`of ciphers use larger LUTs to achieve a higher
`
`throughput. Unfortunately, this leads to an unaccept-
`able increase in code size for many embedded
`applications. The LUTs are stored in the program
`memory (ROM). TEA, IDEA, and Salsa20 do not use
`substitution tables, which allows for a smaller code.
`For the inversion needed in IDEA, we used a slow but
`extremely small algorithm. This explains the small
`code as well as the huge discrepancy between
`encryption and decryption time.
`in
`The results of our implementations appear
`Table 2. As expected, the software-oriented ciphers
`perform better on our platform. We had problems
`decreasing the code size of Hight, but it still shows
`good encryption performance. Although Present
`shows poor performance, its code, along with that of
`IDEA, is extremely small. LEX is a modified AES cipher,
`yet its code is smaller than that of AES because it lacks
`the decryption part.
`Considering the trade-offs between security, cost,
`and performance,
`the stream ciphers seem to be
`a good choice. LEX and Salsa20 do well in both
`throughput and size, yet they are good choices only if
`the encrypted payload is sufficiently large. Otherwise,
`they produce a computational overhead because of
`their huge block length and their setup phase.
`When code size is extremely critical, TEA, IDEA,
`and even Present seem to be reasonable choices. For
`most other cases, AES again shows its strength in
`software.
`
`Asymmetric ciphers
`there are three
`Among public-key algorithms,
`established families of practical relevance: ECC, RSA
`
`526
`
`IEEE Design & Test of Computers
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 5
`Juniper v Huawei
`
`
`
`Table 2. Comparison of software implementations of ciphers.
`
`Block
`
`Encryption
`
`Throughput
`
`Decryption
`
`Relative
`
`(cycles/
`
`at 4 MHz
`
`(cycles/
`
`throughput
`
`Code
`
`size
`
`SRAM
`
`Relative
`
`size
`
`code size
`
`Key
`
`size
`
`Cipher
`
`(bits)
`
`size
`
`(bits)
`
`block)
`
`(Kbps)
`
`block)
`
`(% of AES)
`
`(bytes)
`
`(bytes)
`
`(% of AES)
`
`Hardware-oriented block ciphers
`
`DES
`
`DESXL
`
`Hight
`
`56
`
`184
`
`128
`
`64
`
`64
`
`64
`
`8,633
`
`8,531
`
`2,964
`
`29.6
`
`30.4
`
`80.3
`
`23.7
`
`8,154
`
`7,961
`
`2,964
`
`11,239
`
`38.4
`
`39.4
`
`104.2
`
`30.7
`
`4,314
`
`3,192
`
`5,672
`
`936
`
`0
`
`0
`
`0
`
`0
`
`152.4
`
`112.8
`
`200.4
`
`33.1
`
`Present
`
`80
`
`64
`
`10,723
`
`Software-oriented block ciphers
`
`AES
`
`IDEA
`
`TEA
`
`SEA
`
`128
`
`128
`
`128
`
`96
`
`128
`
`64
`
`64
`
`96
`
`Software-oriented stream ciphers
`
`6,637
`
`2,700
`
`6,271
`
`9,654
`
`77.1
`
`94.8
`
`40.8
`
`39.7
`
`7,429
`
`15,393
`
`6,299
`
`9,654
`
`100.0
`
`123.0
`
`53.0
`
`51.5
`
`2,606
`
`596
`
`1,140
`
`2,132
`
`224
`
`0
`
`0
`
`0
`
`Salsa20
`
`128
`
`512
`
`18,400
`
`111.3
`
`NA
`
`144.4
`
`1,452
`
`280
`
`304
`1,598
`287.3
`NA
`214.6
`5,963
`320
`128
`LEX
`*IDEA: International Data Encryption Algorithm; TEA: Tiny Encryption Algorithm; SEA: Scalable Encryption Algorithm.
`
`100.0
`
`21.1
`
`40.3
`
`75.3
`
`61.2
`
`67.2
`
`(Rivest-Shamir-Adleman), and discrete logarithms.
`ECC is considered the most attractive family for
`embedded environments because of
`its
`smaller
`operand lengths and relatively lower computational
`requirements. ECC has been accepted commercially
`and has also been adopted by standardizing bodies
`such as the American National Standards Institute
`(ANSI), the IEEE, the International Organization for
`Standardization (ISO),
`the Standards for Efficient
`Cryptography Group (SECG), and the National In-
`stitute of Standards and Technology (NIST).
`
`A lightweight elliptic-curve engine
`Interest
`is growing in stand-alone asymmetric
`cipher engines in small, constrained devices for
`applications such as sensor networks and contactless
`smart cards (for example, e-passports). This interest is
`normally dictated by the need for better performance
`to satisfy a communication protocol or energy
`constraints.
`Here, we present a hardware implementation of
`a low-area, stand-alone, public-key processor
`for
`standardized ECC curves, details of which can be
`found elsewhere.7 We sacrifice flexibility to save area
`by setting the design to fit a specific standardized
`binary-field curve that is quite reasonable for con-
`strained devices. Standardized binary fields that pro-
`vide short-term security (113 bits), as well as fields that
`are required for medium-term security applications
`
`(193 bits), are implemented. For some constrained
`applications, 113-bit fields provide adequate security.
`The main reason for choosing a binary field rather
`than a prime field is the carry-free arithmetic, which is
`well-suited for hardware implementations. A second
`reason is the simplified squaring structure, which is
`a central idea used in the algorithms chosen for the
`processor design.
`
`Inversion. Itoh and Tsujii proposed the construction
`of an addition chain such that the inversion could be
`performed in O(log m) multiplications.8 Although the
`algorithm was proposed for optimal normal-basis
`implementations, where squarings are almost
`free
`(cyclic rotations),
`the area requirement
`for
`the
`squaring structure in our implementation is within
`bounds but has the same timing efficiency of 1 clock
`cycle as in the normal basis. Our algorithm exploits the
`fact that squaring is very efficient in the standard basis
`as long as the field is fixed. It’s easy to show that the
`inverse A21 can then be obtained in (tlog2(m 2 1)s +
`Hw(m 2 1) 2 1) multiplications and (m 2 1)
`squarings using this addition chain, where Hw
`denotes
`the Hamming weight of
`the binary
`representation.
`
`Point multiplication. We used a modified version of
`the Montgomery algorithm for implementing the
`point multiplication. We require one inversion and
`
`November–December 2007
`
`527
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 6
`Juniper v Huawei
`
`
`
`Design and Test of ICs for Secure Embedded Computing
`
`Figure 2. Area-optimized GF(2m) elliptic-curve cryptography (ECC) processor.
`
`four multiplications in each iteration. The algorithm
`also lets us compute each iteration without requiring
`extra temporary memory locations, thus reducing
`area.
`
`ECC processor design. The overall design appears
`in Figure 2. The three units—GF(2 m) addition; GF(2 m)
`multiplication,
`implemented as an MSB-first (most-
`significant-bit) multiplier; and GF(2 m) squaring—are
`closely interconnected inside a single arithmetic unit
`sharing the common input data bus A. The adder
`needs an additional data bus B for the second operand,
`and the multiplier requires a single-bit bi signal for the
`multiplicand. The operands are stored in the memory
`as registers (some of them as cyclic registers) with the
`output being selected for A, B, and bi using
`multiplexers with control signals (Asel, Bsel, and bi_sel)
`from the controller. All
`the operand registers are
`connected in parallel
`to data bus C, with the
`appropriate register being loaded on the basis of the
`controller load signal Cld_reg.
`requires no additional
`Inversion, as described,
`hardware apart from the preexisting multiplying unit
`and squaring unit, with some additional control
`circuitry to enable loading the proper variables to
`the appropriate unit.
`
`The implementation was synthesized for a custom
`ASIC design using AMI Semiconductor 0.35-micron
`CMOS technology. The designs have a total area
`ranging from 10 K GEs for a 113-bit field for short-term
`security to 18 K GEs for a 193-bit field for medium-term
`security applications. This implementation shows that
`an extremely small-area implementation of an ECC
`processor is possible in affine coordinates. Table 3
`shows the ECC processor’s total area in GEs and
`latency in clock cycles for a single scalar multiplica-
`tion and compares them with those of other imple-
`mentations.9-12
`
`Hardware-software codesign for ECC
`An ECC coprocessor, as we’ve defined it, can be
`small; it can also be prohibitively expensive for many
`pervasive applications and can be capable of perfor-
`mance that those applications don’t need. Hardware
`assistance in the form of instruction set extensions
`(ISEs) is more favorable in such situations because the
`cost of extra hardware is quite low compared with that
`of a coprocessor. Here, we present an efficient ISE
`implementation for ECC that
`is a tightly coupled
`hardware and software codesign. As a first step, we
`used a software-only ECC implementation to identify
`the functional elements and code segments that would
`
`528
`
`IEEE Design & Test of Computers
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 7
`Juniper v Huawei
`
`
`
`Table 3. ECC processor performance for scalar multiplication.
`
`Total area
`
`Technology
`
`Frequency
`
`Source
`
`This work
`
`Batina et al.9
`
`Gaubatz et al.10
`
`Wolkerstorfer11
`O¨ tztu¨ rk et al.12
`
`Field
`GF (2113)
`GF (2131)
`GF (2163)
`GF (2193)
`GF (267)2
`GF (2131)
`GF (p100)
`GF (2191)
`GF (p166)
`
`(GEs)
`
`10,113
`
`11,970
`
`15,094
`
`17,723
`
`12,944
`
`14,735
`
`18,720
`
`23,000
`
`30,333
`
`(mm)
`
`0.35
`
`0.35
`
`0.35
`
`0.35
`
`0.25
`
`0.25
`
`0.13
`
`0.35
`
`0.13
`
`(MHz)
`
`13.560
`
`13.560
`
`13.560
`
`13.560
`
`0.175
`
`0.175
`
`0.500
`
`68.500
`
`20.000
`
`Time
`
`(ms)
`
`14.4
`
`18.0
`
`31.8
`
`41.7
`
`2,390.0
`
`430.0
`
`410.5
`
`6.7
`
`31.9
`
`provide efficiency gains if implemented as an ISE.
`Then, a hardware model of
`the new processor
`determined the effects of the new extension on such
`parameters as execution time, code size, and data
`RAM usage.
`We used the AT94K family of field-programmable,
`system-level
`ICs as a development platform. This
`architecture integrates an AVR 8-bit microcontroller
`core and FPGA resources on a single chip. We chose
`the standardized 163-bit elliptic curve over GF(2 m), as
`recommended by NIST and ANSI. We used scalar
`point multiplication over this curve for determining the
`benefits of the ISE.
`The pure software implementation was done first.
`Table 4 includes the point arithmetic performance; it
`shows that GF(2 m) multiplication is the most costly
`operation with respect to execution time and memory
`requirement. Moreover,
`in the point multiplication
`algorithms,
`field multiplications are extremely fre-
`quent and therefore constitute the bottleneck opera-
`
`tion for ECC. A closer analysis of the multiplication
`block shows that most of the time was spent for load
`and store operations because the small number of
`registers available in the AVR processor could not hold
`the large operands. Therefore, a hardware extension
`for this functional block would potentially reduce the
`memory bottleneck and speed up the ECC.
`We present a complete GF(2163) multiplier as an ISE
`requiring the minimum possible area. We implemen-
`ted two multiplier architectures that provide a trade-off
`between performance and the extra area requirement.
`is a 163 3 163 least-significant-bit-first
`The first
`multiplier. The multiplier requires 163 AND gates, 167
`XOR gates, and 489 flip-flops. A 1633 163 multiplica-
`tion computes in 163 clock cycles, excluding data
`input and output. In our implementation, overhead
`from control and memory access leads to a total
`execution time of 313 clock cycles.
`An additional trade-off between area and speed is
`possible using the second option, a digit-serial multipli-
`
`Table 4. ECC scalar point multiplication performance at 4 MHz.
`
`Time
`
`Code size
`
`Data RAM
`
`Precomputation
`
`Field multiplier
`
`logic blocks
`
`multiplier
`
`(s)
`
`(bytes)
`
`(bytes)
`
`(bytes)
`
`Combinational
`
`Point
`
`Software multiplier
`
`163 3 163 multiplier
`
`245
`
`163 3 163, 4 digits
`
`498
`
`*NAF: nonadjacent form.
`
`Binary
`
`NAF
`
`Montgomery
`
`Binary
`
`NAF
`
`Montgomery
`
`Binary
`
`NAF
`
`Montgomery
`
`6.039
`
`5.031
`
`4.140
`
`0.264
`
`0.224
`
`0.169
`
`0.183
`
`0.115
`
`0.113
`
`10,170
`
`10,654
`
`8,208
`
`2,936
`
`3,014
`
`2,048
`
`2,936
`
`3,014
`
`2,048
`
`379
`
`379
`
`358
`
`294
`
`294
`
`273
`
`294
`
`294
`
`273
`
`544
`
`0
`
`0
`
`November–December 2007
`
`529
`
`Authorized licensed use limited to: Ellie Daw. Downloaded on June 16,2020 at 15:31:38 UTC from IEEE Xplore. Restrictions apply.
`
`Juniper Ex. 1043-p. 8
`Juniper v Huawei
`
`
`
`Design and Test of ICs for Secure Embedded Computing
`
`Table 5. Performance of ECC scalar multiplication in software.
`
`Finite-field multiplication
`
`Binary multiplication, 160-bit prime field14
`
`Binary multiplication, 160-bit prime field13
`
`Binary multiplication, 134-bit OEF15
`
`Window multiplication, 134-bit OEF15
`*OEF: optimum extension field.
`
`Platform
`
`ATmega128L
`
`ATmega128L
`
`8,051
`
`8,051
`
`Time (ms)
`
`Time (103 cycles)
`
`810
`
`1,040
`
`8,370
`
`1,830
`
`6,480
`
`8,383
`
`100,440
`
`21,960
`
`the
`In a bit-serial multiplier, only one bit of
`er.
`multiplicand is used in each iteration; however, in
`a digit-serial multiplier, multiple