throbber
Copyright (c) 1996 IEEE. See full copyright notice at the end of this paper.
`
`Parallelized Network Security Protocols
`
`Erich Nahum , David J. Yates , Sean O’Malley , Hilarie Orman , and Richard Schroeppel
`
`Department of Computer Science Department of Computer Science
`University of Massachusetts
`University of Arizona
`Amherst, MA 01003
`Tucson, AZ 85721
`{nahum,yates}@cs.umass.edu
`{sean,ho,rcs}@cs.arizona.edu
`
`Abstract
`
`Security and privacy are growing concerns in the Internet
`community, due to the Internet’s rapid growth and the desire
`to conduct business over it safely. This desire has led to the
`advent of several proposals for security standards, such as
`secure IP, secure HTTP, and the Secure Socket Layer. All of
`these standards propose using cryptographic protocols such
`as DES and RSA. Thus, the need to use encryption protocols
`is increasing.
`Shared-memory multiprocessors make attractive server
`platforms, for example as secure World-Wide Web servers.
`These machines are becoming more common, as shown by
`recent vendor introductions of platforms such as SGI’s Chal-
`lenge, Sun’s SPARCCenter, and DEC’s AlphaServer. The
`spread of these machines is due both to their relative ease
`of programming and their good price/performance.
`This paper is an experimental performance study that
`examines how encryption protocol performance can be im-
`proved by using parallelism. We show linear speedup
`for several different Internet-based cryptographic protocol
`stacks running on a symmetric shared-memory multiproces-
`sor using two different approaches to parallelism.
`
`1 Introduction
`
`Security and privacy is a growing concern in the Internet
`community, due to the Internet’s rapid growth and the desire
`to conduct business over it safely. This desire has led to the
`advent of several proposals for security standards, such as
`secure IP [2], Secure HTTP (SHTTP) [30], and the Secure
`Socket Layer (SSL) [14]. Thus, the need to use encryption
`protocols such as DES and RSA is increasing.
`This research supported in part by NSF under grant NCR-9206908, and
`by ARPA under contract F19628-92-C-0089. Erich Nahum is supported by
`a Computer Measurement Group Fellowship. David Yates is the recipient
`of a Motorola Codex University Partnership in Research Grant.
`This research supported in part by ARPA under contract DABT63-94-
`C-0002, and by NCSC under contract MDA 904-94-C-6110.
`
`One problem with using cryptographic protocols is the
`fact that they are slow. An important question then is
`whether security can be provided at gigabit speeds. The
`standard set of algorithms required to secure a connection
`includes a bulk encryption algorithm such as DES [1], a
`cryptographic message digest such as MD5 [31], a key ex-
`change algorithm such as Diffie-Hellman [7] to securely
`distribute a private key, and some form of digital signature
`algorithm to authenticate the parties, such as RSA [32]. The
`encryption and hash digest algorithms must be applied to
`every packet going across a link to ensure confidentiality,
`and therefore the performance of these algorithms directly
`affects the achievable throughput of an application. Fur-
`thermore, there are some services, such as strong sender au-
`thentication in scalable multicast algorithms, that currently
`require the use of expensive algorithms such as RSA signa-
`tures on every packet. For non-multicast services, the most
`expensive algorithms, RSA and Diffie-Hellman, need only
`be run at connection set-up time. They only affect the over-
`all bandwidth if the connections are short, the algorithms are
`particularly expensive, and/or the overhead of using these
`algorithms on busy servers reduces overall network perfor-
`mance.
`A straightforward approach to improving cryptographic
`performance is to implement cryptographic algorithms in
`hardware. This approach has been shown to improve cryp-
`tographic performance of single algorithms (e.g., DEC has
`demonstrated a 1 Gbit/sec DES chip [8]). Unfortunately
`there are several problems with this approach, that indicate
`why one would desire to do cryptography in software:
`
`Variability. Systems need to support suites of cryp-
`tographic protocols, not just DES. For example, both
`SSL and SHTTP support using RSA, DES, Triple-DES,
`MD5, RC4, and IDEA. Hardware support for all of
`these is unlikely.
`
`Flexibility. Standards change over time, and this is one
`reason why the IP security architecture and proposed
`
`1
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`key exchange schemes are designed to handle multiple
`type of authentication and confidentiality algorithms.
`Security. Algorithms can be broken, such as knapsack
`crypto-systems, 129 digit RSA, and 192 bit DHKX. It
`is easier to replace software cryptographic algorithms
`than hardware.
`Cost. Custom hardware is not cheap, ubiquitous, or
`exportable.
`Performance. Certain algorithms, such as MD5 and
`IDEA, are designed to run quickly in software on cur-
`rent microprocessor architectures. Specialized hard-
`ware for them may not improve performance much.
`An analysis of this is given for MD5 [37].
`Many approaches are available for improving software
`cryptographic support [23], including improved algorithm
`design and algorithm-independent hardware support. In this
`paper, we focus on how parallelism can improve software
`cryptographic performance.
`A common approach to parallelism is through sym-
`metric shared-memory multiprocessors. These machines
`are becoming more common, as shown by recent vendor
`introductions of machines such as SGI’s Challenge [11],
`Sun’s SPARCCenter [9], and DEC’s AlphaServer [10]. The
`spread of these machines is due to a number of factors:
`binary compatibility with lower-end workstations, good
`price/performance relative to high-end machines such as
`Crays, and their ease of programming compared to more
`elaborate parallel machines such as Hypercubes. In addi-
`tion, these machines make attractive server platforms, for
`example as secure HTTP (World-Wide Web) servers.
`In this paper we demonstrate that parallelism is an effec-
`tive vehicle for improving software cryptographic perfor-
`mance. We show linear speedup results of several different
`Internet-based cryptographic protocol stacks using two dif-
`ferent approaches to parallelism. Our implementation con-
`sists of parallelized versions of the x-kernel [15] that run
`in user space on Silicon Graphics shared-memory multipro-
`cessors. We give performance results on a 12 processor
`100MHz MIPS R4400 SGI Challenge XL.
`The remainder of the paper is organized as follows: In
`Section 2, we discuss several approaches to parallelism.
`Section 3 describes our implementation and experiments.
`In Section 4 we present our results in detail. Section 5
`discusses related issues.
`In Section 6 we summarize our
`results.
`
`2 Parallelism in Network Protocol Processing
`
`Parallelism can take many forms in network protocol
`processing. Many approaches to parallelism have been pro-
`posed and are briefly described here; more detailed surveys
`
`TCP
`
`TCP
`
`TCP
`
`1
`
`1
`
`IP
`
`2
`
`2
`
`1
`
`2
`
`IP
`
`1
`
`1
`
`IP
`
`2
`
`2
`
`FDDI
`
`FDDI
`
`FDDI
`
` Layer
`Parallelism
`
`Connection
` Level
`Parallelism
`
` Packet
` Level
`Parallelism
`
`Processing
`Element
`
`TCP
`
`Protocol
`
`1
`
`Packet
`
`Packet
` Flow
`
`Figure 1: Approaches to Concurrency
`
`can be found in [3, 13]. In general, we attempt to classify
`approaches by the unit of concurrency, or what it is that pro-
`cessing elements do in parallel. Here a processing element
`is a locus of execution for protocol processing, and can be a
`dedicated processor, a heavyweight process, or a lightweight
`thread. Figure 1 illustrates the various approachesto concur-
`rency in host protocol processing. The dashed ovals repre-
`sent processing elements. Messages marked with a number
`indicate which connection the message is associated with.
`In layered parallelism, each protocol layer is a unit of
`concurrency, where a protocol layer is defined by the ISO
`reference model. Specific layers are assigned to processing
`elements, and messages passed between protocols through
`interprocess communication. The main advantage of lay-
`ered parallelism is that it is simple and defines a clean sep-
`aration between protocol boundaries. The disadvantages
`are that concurrency is limited to the number of layers in
`the stack, and that associating processing with layers re-
`sults in increased context switching and synchronization
`between layers [5, 6, 34]. Performance gains are limited to
`throughput, mainly achieved through pipelining effects. An
`example is found in [12].
`Connections form the unit of concurrency in connection-
`level parallelism, where connections are assigned to pro-
`cessing elements. Speedup is achieved using multiple con-
`nections, each of which is processed in parallel. The ad-
`vantage of this approach is that it exploits the natural con-
`currency between connections, and only requires one lock
`acquisition, i.e. for the appropriate connection. Thus, lock-
`
`2
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`ing is kept to a minimum along the "fast path" of data trans-
`fer. The disadvantage with connection-level parallelism
`is that no concurrency within a single connection can be
`achieved, which may be a problem if traffic exhibits locality
`[16, 19, 22, 27], i.e., is bursty. Systems using this approach
`include [29, 33, 34, 38].
`In packet-level parallelism, packets are the unit of
`concurrency. Sometimes referred to as thread-per-packet
`or processor-per-message, packet-level parallelism assigns
`each packet or message to a single processing element. The
`advantage of this approach is that packets are processed re-
`gardless of their connection or where they are in the protocol
`stack, achieving speedup both with multiple connections and
`within a single connection. The disadvantage is that it re-
`quires locking shared state, most significantly the protocol
`state at each layer. Systems using this approach include
`[3, 13].
`In functional parallelism, a protocol layer’s functions are
`the unit of concurrency. Functions within a single protocol
`layer (e.g., checksum, ACK generation) are decomposed,
`and each assigned to a processing element. The advantage to
`this approach is that it is relatively fine-grained, and thus can
`improve latency as well as throughput. The disadvantage
`is that it requires synchronizing within a protocol layer, and
`is dependent upon the concurrency available between the
`functions of a particular layer. Examples include [17, 28,
`25].
`In data-level parallelism, the pieces of data are the units
`of concurrency, analogous to SIMD processing. Processing
`elements are assigned to the same function of a particular
`layer, but perform processing on separate pieces of data from
`the samemessage. An example would be computing a single
`message’s checksum using multiple processors. The advan-
`tage to this approach is that it is the most fine-grained, and
`thus has the potential for the greatest improvement in both
`throughput and latency. The disadvantage is that processing
`elements must synchronize, which may be expensive. We
`are unaware of any work using this approach.
`The relative merits of one approach over another depend
`on many factors, including the host architecture, the cost
`of primitives such as locking and context switching, the
`workload and number of connections, the thread scheduling
`policies employed, and whether the implementations are in
`hardware or software. Most importantly, they depend on the
`available concurrency within a protocol stack.
`The most comprehensive study to date comparing dif-
`ferent approaches to parallelism on a shared-memory mul-
`tiprocessor is by Schmidt and Suda [34, 35]. They show
`that packet-level parallelism and connection-level paral-
`lelism generally perform better than layer parallelism, due to
`the context-switching overhead incurred crossing protocol
`boundaries using layer parallelism. In [35], they suggest that
`packet-level parallelism is preferable when the workload is
`
`a relatively small number of active connections, and that
`connection-level parallelism is preferable for large numbers
`of connections.
`In this paper, we restrict our focus to connection-level and
`packet-level parallel approaches to network cryptographic
`protocols on shared-memory multiprocessors. More fine-
`grained approaches to parallelized security protocols, such
`as functional parallelism and data-level parallelism, are out-
`side the scope of this work. We do discuss them in more
`depth in Section 5.
`
`3 Implementation and Experiments
`
`In this section we describe our implementation and experi-
`mental environment.
`Our implementation consists of parallelized versions of
`the x-kernel [15] extended for packet-level parallelism [24]
`and connection-level parallelism [38]. Our parallel imple-
`mentations run in user space on Silicon Graphics shared-
`memory multiprocessors using the IRIX operating system.
`In this study, we examine the impact of cryptographic
`protocols under the respective paradigms, and show how
`parallelism improves cryptographic performance. The pro-
`tocols we examine are those used in typical secure Internet
`scenarios. We briefly describe them here.
`
`3.1 Protocols
`
`The protocols used in our experiments are those that would
`be seen along the common case or “fast path” during data
`transfer of an application such as a secure World-Wide Web
`server. We focus on available throughput; we do not ex-
`amine connection setup or teardown, or the attendant issues
`of key exchange. In these experiments, connections are al-
`ready established, and keys are assumed to be available, as
`needed.
`We use the terminology defined by the proposed secure
`IP standard [2]. Authentication is the property of knowing
`that the data received is the same as the data sent by the
`sender, and that the claimed sender is in fact the actual
`sender. Integrity is the property that the data is transmitted
`from source to destination without undetected alteration.
`Confidentiality is the property that the intended participants
`know what data was sent, but any unintended parties do not.
`Encryption is typically used to provide confidentiality.
`TCP is the Transmission Control Protocol used by reli-
`able Internet applications such as file transfer, remote login,
`and HTTP, the protocol used for retrieving hypertext doc-
`uments in the World-Wide Web.
`IP is the network-layer
`protocol that performs routing of messages over the Inter-
`net. FDDI is the Fiber Distribute Data Interface,a fiber-optic
`token-ring based LAN protocol.
`
`3
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`MD5 is a message digest algorithm used for authentica-
`tion and message integrity. MD5 is a “required option” for
`secure IP; by required option we mean that an application’s
`use of MD5 in IP is optional, but an implementation must
`support use of that option. MD5 is also the default mes-
`sage digest algorithm proposed for SHTTP; and is also used
`in SSL. In our implementation we use the standard MD5
`message digest calculation, rather than the keyed one. DES
`is the ANSI Data Encryption Standard, used for confiden-
`tiality, and is one of the required protocols used in secure
`IP, secure HTTP, and SSL. We use DES in cipher-block-
`chaining (CBC) mode. 3-DES is "triple DES," which runs
`the DES algorithm 3 times over the message. DES experi-
`ments can produce unrealistically good times on empty data
`buffers, since only a fraction of the 64KB Sbox is exer-
`cised, which may lead to artificially good cache hit rates.
`To guard against this, we fill each buffer at the beginning
`of a test with a random sequence, with each test using a
`different initial seed value. Our protocols are taken from
`the cryptographic suite available with the x-kernel [26].
`
`3.2 Parallel Infrastructure
`
`The implementations for packet-level parallel protocols and
`connection-level parallel protocols are described in detail in
`[24, 38]. We briefly outline them here.
`In the packet-level parallel protocol testbed, packets are
`assigned to threads as they arrive, regardless of the connec-
`tions they are associated with. Correctness and safety is
`maintained by placing locks around shared data structures,
`such as the TCP connection state or a protocol’s set of active
`associations, termed the active map in the x-kernel. Threads
`must contend with one another to gain access to shared data.
`In the connection-level parallel testbed, connections are
`assigned to threads on one-to-one basis, called thread-per-
`connection. The entire connection forms the unit of con-
`currency, only allowing one thread to perform protocol pro-
`cessing for any particular connection. Arriving packets are
`demultiplexed to the appropriate thread via a packet-filter
`mechanism [21]. The notion of a connection is extended
`through the entire protocol stack. Where possible, data
`structures are replicated per-thread, in order to avoid lock-
`ing and therefore contention.
`The two schemes do have some implementation differ-
`ences that are necessitated by their respective approaches to
`concurrency. However, wherever possible, we have strived
`to make the implementations consistent. For example, both
`use the same TCP uniprocessor source code base, which is
`derived from Berkeley’s Net/2 TCP [18] with BSD 4.4 fixes,
`but not the RFC1323 extensions [4].
`
`We use a double-Sbox implementation in our tests.
`
`DATA
`FLOW
`
`THROUGHPUT
` TEST
`
`TCP
`
`MD5
`
`IP
`
`FDDI
`
`SIM−TCP−RECV
`
`ACKS
`
`Figure 2: Sample Configuration: TCP/MD5/IP
`
`3.3 In-Memory Drivers
`
`Since our platform runs in user space, accessing the FDDI
`adaptor involves crossing the IRIX socket layer, which is
`prohibitively expensive. Normally, in a user-space imple-
`mentation of the x-kernel, a simulated device driver is con-
`figured below the media access control layer (in this case,
`FDDI). The simulated driver uses the socket interface to em-
`ulate a network device. To avoid this socket-crossing cost,
`we replaced the simulated driver with in-memory device
`drivers for the TCP protocol stacks. The drivers emulate
`a high-speed FDDI interface, and support the FDDI max-
`imum transmission unit (MTU) of slightly over 4K bytes.
`This is similar to the approaches taken in [3, 13, 20, 34].
`The drivers act as senders or receivers, producing or con-
`suming packets as fast as possible, to simulate the behavior
`of simplex data transfer over an error-free network. To
`minimize execution time and experimental perturbation, the
`receive-side drivers use preconstructed packet templates.
`They do not calculate TCP checksums, MD5 hash values,
`or encrypt packets for decryption. Instead, in experiments
`that use a simulated sender, checksums, signatures, and de-
`cryptions are all performed, but the results are ignored, and
`assumed correct.
`Figure 2 shows a sample test configuration, in this case
`a TCP/MD5/IP stack. Our test environment measures
`throughput through the network subsystem on the multipro-
`cessor; it does not measure external factors such as latency
`across the wire to a remote host.
`
`4
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`4 Results
`
`We present results for packet-level and connection-level par-
`allelism here.
`
`4.1 Packet-Level Parallelism
`
`Figure 3 shows the sender’s throughput rates for our packet-
`level parallel (PLP) experiments. Times given are in
`megabits per second for several protocol stacks. Our base-
`line Internet stack consists of TCP/IP/FDDI, representing
`protocol processing without any security. A second stack is
`an Internet stack with MD5 between TCP and IP, represent-
`ing the work done for an application that requires authenti-
`cation and integrity but no confidentiality. Our third stack
`uses DES above TCP and MD5 below TCP, which supports
`both confidentiality and integrity. Our fourth stack is the
`same as the third, except that we use triple-DES instead of
`DES.
`These throughputs were measured on our 12-processor
`Challenge machine, using a single TCP connection with 4
`KB packets. For these and all subsequent graphs, each data
`point is the average of 10 runs, where a run consists of
`measuring the steady-state throughput for 30 seconds, after
`an initial 30 second warmup period. Throughput graphs
`include 90 percent confidence intervals.
`Figure 3 quantifies the slowdown due to the use of cryp-
`tographic protocols. The baseline speed for the send-side
`TCP stack is roughly 138 Mbits/sec. Adding MD5 to the
`stack reduces throughput by nearly an order of magnitude, to
`a mere 18 Mbits/sec . Adding DES on top of TCP reduces
`throughput nearly 2 orders of magnitude, to 4.6 Mbits/sec.
`Using Triple-DES is 3 times slower at 1.5 Mbits/sec.
`Figure 4 shows the corresponding relative speedup for
`the send-side tests, where speedup is throughput normalized
`relative to the uniprocessor throughput for the appropriate
`stack. The theoretical ideal linear speedup is included for
`comparison. Previous work [3, 24] has shown limited per-
`formance gains when using packet-level parallelism for a
`single TCP connection, barring any other protocol process-
`ing, and this is reflected by the baseline TCP/IP stack’s
`minimal speedup. This is because manipulating a TCP con-
`nection’s state is large relative to the IP and FDDI processing
`and must occur inside a single locked, serial component. Of
`course, throughput can be improved by using multiple con-
`nections.
`However, as more compute-intensive cryptographic pro-
`tocols are used, while the throughput goes down, the relative
`speedup improves. For example, the MD5 stack achieves a
`speedup of 8 with 12 processors, and the DES and Triple-
`DES stacks produce very close to linear speedup. This
`
`MD5 runs 30-50% slower on big-endian hosts [37], such as our
`Challenge.
`
`Protocol
`Stack
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/IP
`
`Send
`Side
`236
`1737
`8982
`21461
`
`Inc.
`Cost
`
`1500
`8746
`21225
`
`Recv
`Side
`189
`1708
`9179
`21121
`
`Inc.
`Cost
`
`1519
`8990
`20932
`
`Table 1: PLP Latency Breakdown (usec)
`
`is because the cost of cryptographic protocol processing,
`which outweighs the cost of TCP processing, occurs outside
`the scope of any locks.
`Figures 5 and 6 show the throughput and speedup re-
`spectively for the same stacks on the receive side. Again we
`observe successively lower throughputs but better relative
`speedups as more compute intensive cryptographic proto-
`cols are used. The speedup curve for 3-DES, which is the
`most compute-intensive, is essentially linear.
`Table 1 gives the latency breakdown for the various pro-
`tocol stacks. Latency here is calculated from the unipro-
`cessor throughput, to gain an understanding of the relative
`cost of adding cryptographic protocols. The table includes
`total time and incremental overhead for adding a protocol
`in addition to the baseline TCP/IP processing time. The in-
`cremental cost for MD5 is about 1500 usec, for DES about
`7200 usec, and for 3-DES about 21000 usec. Since these
`costs are incurred outside the scope of any locks, they can
`run in parallel on different packets.
`Given that the locked component of manipulating the
`TCP connection-state limits the throughput to about 200
`Mbits on this platform, we estimate that the TCP/MD5/IP
`stack would bottleneck at about 16 processors, and that the
`DES stack would scale to 30 processors. We expect that
`more compute-intensive protocols, such as RSA, would also
`scale linearly.
`We also ran similarly configured UDP-based stacks, not
`shown due to space limitations. In general, the results were
`similar, except that single-connection parallelism with the
`baseline UDP stacks exhibited much better speedup that the
`single-connection baseline TCP stacks. However, as crypto-
`graphic processing is used, the differences in both through-
`put and speedup between TCP-based and UDP-based stacks
`essentially disappear.
`
`4.2 Connection-Level Parallelism
`
`Figures 7 and 8 show send-side throughput and speedup re-
`spectively for our Connection-Level Parallel (CLP) experi-
`ments. In these experiments, 12 connections are measured,
`and the number of processors is varied from 1 to 12. The
`throughput graphs here show aggregate throughput for all 12
`connections. We use the same protocols and experimental
`
`5
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`Ideal Linear Speedup
`3-DES/TCP/IP
`DES/TCP/MD5/IP
`TCP/MD5/IP
`TCP/IP
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Speedup
`
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/IP
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`Figure 3: PLP Send-Side Throughput
`
`Figure 4: PLP Send-Side Speedup
`
`Ideal Linear Speedup
`3-DES/TCP/IP
`DES/TCP/MD5/IP
`TCP/MD5/IP
`TCP/IP
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Speedup
`
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/IP
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`Figure 5: PLP Receive-Side Throughput
`
`Figure 6: PLP Receive-Side Speedup
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`400
`
`300
`
`200
`
`100
`
`0
`
`Throughput (MBits/sec)
`
`Throughput (MBits/sec)
`
`6
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`methodology as with the PLP experiments, with the one
`exception that in some cases, the warmup period is longer
`than 30 seconds.
`As with packet-level-parallelism,the throughputs decline
`asmorecompute-intensive cryptographic operations areper-
`formed. Previous work [38] showed good speedup for CLP,
`barring any other protocol processing. Figure 8 illustrates
`this in the baseline case (i.e., the TCP/IP stack). CLP ex-
`hibits good speedup when the workload has a sufficient
`number of active connections because each connection is
`serviced by its own thread, and there is no explicit synchro-
`nization between threads. Figure 8 also shows that when any
`cryptographic processing is added to the scenario, speedup
`becomes essentially linear. This is because speedup for
`CLP depends on the ratio of compute-bound vs. memory
`bound processing. The more compute-bound an application
`is, the better the speedup will be. The more memory-bound
`an application is, the greater the likelihood that memory
`contention between processors will limit speedup. Since
`cryptographic protocols are compute-intensive, CLP stacks
`using them exhibit better speedup.
`An interesting result is that while the two approaches to
`parallelism behave very differently in the baseline cases(i.e.,
`the standard TCP/IP stacks), as more cryptographic process-
`ing is done, the more similar the schemes appear in both in
`terms of throughput, speedup, and latency. For example,
`in the send-side experiment using DES, both schemes have
`a throughput of roughly 50 Mbits/sec with 12 processors.
`Again, this is because the cryptographic processing, which
`is similar in the two schemes, vastly outweighs differences
`between the two approaches, as well as the differences in
`the experiments (i.e., single vs. multiple connections).
`Figures 9 and 10 show that for the receive side, the base-
`line experiments show better throughput and speedup than
`for the send side. However, once cryptographic protocols
`are added, the same trends are exhibited.
`Table 2 gives the appropriate latency breakdown for the
`connection-level parallel stack. Again, incremental costs
`are relative to the baseline TCP/IP stack. The incremental
`cost for MD5 in this case is about 1550 usec, for DES about
`7100 usec, and for 3-DES about 21500 usec.
`Figure 11 shows how connection-level parallelism scales
`with large numbers of connections for the send side. In this
`experiment, 12 processors were used, and the workload var-
`ied from 12 to 384 connections. Two forms of connection-
`level parallelism were used here:
`thread-per-connection,
`where the number of threads is equal to the number of con-
`nections, and processor-per-connection, where the number
`of threads equals the number of processors Note that at 12
`connections the two schemes are equivalent.
`Previous work [38] has shown that, barring any other pro-
`tocol processing, thread-per-connection achieves higher ag-
`gregate throughput than processor-per-connection, but that
`
`Protocol
`Stack
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/MD5/IP
`
`Send
`Side
`226
`1772
`8694
`23150
`
`Inc.
`Cost
`
`1546
`8468
`22924
`
`Recv
`Side
`214
`1769
`8985
`23237
`
`Inc.
`Cost
`
`1555
`8771
`23023
`
`Table 2: CLP Latency Breakdown (usec)
`
`processor-per-connection distributes the aggregate band-
`width more fairly between connections. Figure 11 shows
`that, as cryptographic protocol processing is added, the dif-
`ference in performance between the thread-per-connection
`and processor-per-connection versions of CLP disappear.
`Again,
`this is because cryptographic computation over-
`whelms any costs due to memory referencing or increasing
`the parallelism by using more threads.
`
`5 Discussion
`
`In this paper we have shown how both packet-level and
`connection-level parallelism can be used to improve cryp-
`tographic protocol performance. We have not addressed
`functional parallelism or more fine-grained approaches to
`parallelized cryptography, such as using multiple processors
`to encrypt a single message in parallel. Such an approach
`would not only improve throughput, but might also reduce
`the latency as seen by an individual message.
`For example, DES in Electronic Code Book (ECB) mode
`can be run in parallel on different blocks of a single mes-
`sage. However, DES using ECB is susceptible to simple-
`substitution code attacks and cut-and-paste forgery, both of
`which are realistic worries in computer systems which send
`large amounts of known text. Thus, most DES implemen-
`tations use CBC mode, where a plaintext block is XOR’ed
`with the ciphertext of the previous block, making each block
`dependent on the previous one, and preventing a parallelized
`implementation. However, each 8 byte block of a message
`encrypted with DES in CBC mode could be decrypted in
`parallel, since computing the plaintext block requires only
`the key, the ciphertext block, and the previous ciphertext
`block.
`In practice, DES CBC must be used with some form
`of message integrity check to thwart cut-and-paste forg-
`eries. MD5 is not amenable to fine-grain parallelism, and
`this limits the opportunities for applying these methods.
`Some avenues for research include finding faster or paral-
`lelizable message integrity algorithms, and combining these
`with DES modes that allow finer-grain parallel encryption
`techniques, and especially modes that allow the sender and
`receiver to use different processing granularities.
`
`7
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(cid:70)(cid:82)(cid:80)(cid:15)(cid:3)(cid:44)(cid:81)(cid:70)(cid:17)(cid:3)(cid:72)(cid:87)(cid:3)(cid:68)(cid:79)(cid:17)(cid:3)(cid:3)(cid:3)(cid:40)(cid:91)(cid:75)(cid:76)(cid:69)(cid:76)(cid:87)(cid:3)(cid:20)(cid:19)(cid:19)(cid:25)(cid:3)
`
`

`
`Ideal Linear Speedup
`3-DES/TCP/MD5/IP
`DES/TCP/MD5/IP
`TCP/MD5/IP
`TCP/IP
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Speedup
`
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/MD5/IP
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`Figure 7: CLP Send-Side Throughput
`
`Figure 8: CLP Send-Side Speedup
`
`Ideal Linear Speedup
`3-DES/TCP/MD5/IP
`DES/TCP/MD5/IP
`TCP/MD5/IP
`TCP/IP
`
`12
`
`10
`
`8
`
`6
`
`4
`
`2
`
`0
`
`Speedup
`
`TCP/IP
`TCP/MD5/IP
`DES/TCP/MD5/IP
`3-DES/TCP/MD5/IP
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`0
`
`2
`
`4
`
`6
`Processors
`
`8
`
`10
`
`12
`
`Figure 9: CLP Receive-Side Throughput
`
`Figure 10: CLP Receive-Side Speedup
`
`1800
`
`1600
`
`1400
`
`1200
`
`1000
`
`800
`
`600
`
`400
`
`200
`
`0
`
`1800
`
`1600
`
`1400
`
`1200
`
`1000
`
`800
`
`600
`
`400
`
`200
`
`0
`
`Throughput (Mb/sec)
`
`Throughput (Mb/sec)
`
`8
`
`(cid:36)(cid:80)(cid:68)(cid:93)(cid:82)(cid:81)(cid:17)(

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket