`Proceedings of the USENIX 1996 Annual Technical Conference
`San Diego, California, January 1996
`
`Eliminating Receive Livelock in an
`Interrupt-driven Kernel
`
`Jeffrey Mogul, DEC Western Research Laboratory
`K. K. Ramakrishnan, AT&T Bell Laboratories
`
`For more information about USENIX Association contact:
`1. Phone:
`510 528-8649
`2. FAX:
`510 548-5738
`3. Email:
`office@usenix.org
`4. WWW URL: http://www.usenix.org
`
`Petitioner Apple Inc. - Exhibit 1009, p. 1
`
`
`
`Eliminating Receive Livelock in an Interrupt-driven Kernel
`
`Jeffrey C. Mogul
`Digital Equipment Corporation Western Research Laboratory
`K. K. Ramakrishnan
`AT&T Bell Laboratories
`
`Abstract
`
`Most operating systems use interface interrupts to
`schedule network tasks. Interrupt-driven systems can
`provide low overhead and good latency at low of-
`fered load, but degrade significantly at higher arrival
`rates unless care
`is
`taken
`to prevent several
`pathologies. These are various forms of receive
`livelock, in which the system spends all its time
`processing interrupts, to the exclusion of other neces-
`sary tasks. Under extreme conditions, no packets are
`delivered to the user application or the output of the
`system.
`To avoid livelock and related problems, an operat-
`ing system must schedule network interrupt handling
`as carefully as it schedules process execution. We
`modified an interrupt-driven networking implemen-
`tation to do so; this eliminates receive livelock with-
`out degrading other aspects of system performance.
`We present measurements demonstrating the success
`of our approach.
`
`1. Introduction
`Most operating systems use interrupts to inter-
`nally schedule the performance of tasks related to I/O
`events, and particularly the invocation of network
`protocol software. Interrupts are useful because they
`allow the CPU to spend most of its time doing useful
`processing, yet respond quickly to events without
`constantly having to poll for event arrivals.
`Polling is expensive, especially when I/O events
`are relatively rare, as is the case with disks, which
`seldom interrupt more than a few hundred times per
`second. Polling can also increase the latency of
`response to an event. Modern systems can respond to
`an interrupt in a few tens of microseconds; to achieve
`the same latency using polling, the system would
`have to poll tens of thousands of times per second,
`which would create excessive overhead. For a
`general-purpose system, an interrupt-driven design
`works best.
`Most extant operating systems were designed to
`handle I/O devices that interrupt every few mil-
`liseconds. Disks tended to issue events on the order
`
`of once per revolution; first-generation LAN environ-
`ments tend to generate a few hundred packets per
`second for any single end-system. Although people
`understood the need to reduce the cost of taking an
`interrupt, in general this cost was low enough that
`any normal system would spend only a fraction of its
`CPU time handling interrupts.
`The world has changed. Operating systems typi-
`cally use the same interrupt mechanisms to control
`both network processing and traditional I/O devices,
`yet many new applications can generate packets
`several orders of magnitude more often than a disk
`can generate seeks. Multimedia and other real-time
`applications will become widespread. Client-server
`applications, such as NFS, running on fast clients and
`servers can generate heavy RPC loads. Multicast and
`broadcast protocols subject innocent-bystander hosts
`to loads that do not interest them at all. As a result,
`network implementations must now deal with sig-
`nificantly higher event rates.
`Many multi-media and client-server applications
`share another unpleasant property: unlike traditional
`network applications (Telnet, FTP, electronic mail),
`they are not flow-controlled. Some multi-media ap-
`plications want constant-rate, low-latency service;
`RPC-based client-server applications often use
`datagram-style transports, instead of reliable, flow-
`controlled protocols. Note that whereas I/O devices
`such as disks generate interrupts only as a result of
`requests from the operating system, and so are in-
`herently flow-controlled, network interfaces generate
`unsolicited receive interrupts.
`The shift to higher event rates and non-flow-
`controlled protocols can subject a host to congestive
`collapse: once the event rate saturates the system,
`without a negative feedback loop to control the
`sources, there is no way to gracefully shed load. If
`the host runs at full throughput under these con-
`ditions, and gives fair service to all sources, this at
`least preserves the possibility of stability. But if
`throughput decreases as the offered load increases,
`the overall system becomes unstable.
`Interrupt-driven systems tend to perform badly
`under overload. Tasks performed at interrupt level,
`
`Petitioner Apple Inc. - Exhibit 1009, p. 2
`
`
`
`by definition, have absolute priority over all other
`tasks. If the event rate is high enough to cause the
`system to spend all of its time responding to inter-
`rupts, then nothing else will happen, and the system
`throughput will drop to zero. We call this condition
`receive livelock: the system is not deadlocked, but it
`makes no progress on any of its tasks.
`Any purely interrupt-driven system using fixed in-
`terrupt priorities will suffer from receive livelock un-
`der input overload conditions. Once the input rate
`exceeds the reciprocal of the CPU cost of processing
`one input event, any task scheduled at a lower
`priority will not get a chance to run.
`Yet we do not want to lightly discard the obvious
`benefits of an interrupt-driven design. Instead, we
`should integrate control of the network interrupt han-
`dling sub-system into the operating system’s schedul-
`ing mechanisms and policies. In this paper, we
`present a number of simple modifications to the
`purely interrupt-driven model, and show that they
`guarantee throughput and improve latency under
`overload, while preserving the desirable qualities of
`an interrupt-driven system under light load.
`
`2. Motivating applications
`We were led to our investigations by a number of
`specific applications that can suffer from livelock.
`Such applications could be built on dedicated single-
`purpose systems, but are often built using a general-
`purpose system such as UNIX, and we wanted to
`find a general solution to the livelock problem. The
`applications include:
`• Host-based routing: Although inter-network
`routing is traditionally done using special-
`purpose (usually non-interrupt-driven) router
`systems, routing is often done using more con-
`ventional hosts.
` Virtually
`all
`Internet
`‘‘firewall’’ products use UNIX or Windows
`NT systems for routing [7, 13]. Much ex-
`perimentation with new routing algorithms is
`done on UNIX [2], especially for IP multicast-
`ing.
`• Passive network monitoring: network managers,
`developers, and researchers commonly use
`UNIX systems, with their network interfaces in
`‘‘promiscuous mode,’’ to monitor traffic on a
`LAN for debugging or statistics gathering [8].
`• Network file service: servers for protocols such
`as NFS are commonly built from UNIX sys-
`tems.
`These applications (and others like them, such as
`Web servers) are all potentially exposed to heavy,
`non-flow-controlled loads. We have encountered
`livelock in all three of these applications, have solved
`or mitigated the problem, and have shipped the solu-
`
`tions to customers. The rest of this paper con-
`centrates on host-based routing, since this simplifies
`the context of the problem and allows easy perfor-
`mance measurement.
`
`3. Requirements for scheduling network tasks
`Performance problems generally arise when a sys-
`tem is subjected to transient or long-term input over-
`load.
`Ideally, the communication subsystem could
`handle the worst-case input load without saturating,
`but cost considerations often prevent us from build-
`ing such powerful systems. Systems are usually
`sized to support a specified design-center load, and
`under overload the best we can ask for is controlled
`and graceful degradation.
`When an end-system is involved in processing
`considerable network traffic, its performance depends
`critically on how its tasks are scheduled. The
`mechanisms and policies
`that schedule packet
`processing and other tasks should guarantee accept-
`able system throughput, reasonable latency and jitter
`(variance in delay), fair allocation of resources, and
`overall system stability, without imposing excessive
`overheads, especially when the system is overloaded.
`We can define throughput as the rate at which the
`system delivers packets to their ultimate consumers.
`A consumer could be an application running on the
`receiving host, or the host could be acting as a router
`and forwarding packets to consumers on other hosts.
`We expect the throughput of a well-designed system
`to keep up with the offered load up to a point called
`the Maximum Loss Free Receive Rate (MLFRR), and
`at higher loads throughput should not drop below this
`rate.
`Of course, useful throughput depends not just on
`successful reception of packets; the system must also
`transmit packets. Because packet reception and
`packet transmission often compete for the same
`resources, under
`input overload conditions
`the
`scheduling subsystem must ensure that packet trans-
`mission continues at an adequate rate.
`Many applications, such as distributed systems
`and interactive multimedia, often depend more on
`low-latency, low-jitter communications than on high
`throughput. Even during overload, we want to avoid
`long queues, which increases latency, and bursty
`scheduling, which increases jitter.
`When a host is overloaded with incoming network
`packets, it must also continue to process other tasks,
`so as to keep the system responsive to management
`and control requests, and to allow applications to
`make use of the arriving packets. The scheduling
`subsystem must fairly allocate CPU resources among
`packet
`reception, packet
`transmission, protocol
`
`Petitioner Apple Inc. - Exhibit 1009, p. 3
`
`
`
`processing, other I/O processing, system housekeep-
`ing, and application processing.
`A host that behaves badly when overloaded can
`also harm other systems on the network. Livelock in
`a router, for example, may cause the loss of control
`messages, or delay their processing. This can lead
`other routers to incorrectly infer link failure, causing
`incorrect routing information to propagate over the
`entire wide-area network. Worse, loss or delay of
`control messages can lead to network instability, by
`causing positive feedback in the generation of control
`traffic [10].
`
`4. Interrupt-driven scheduling and its
`consequences
`Scheduling policies and mechanisms significantly
`affect the throughput and latency of a system under
`overload.
`In an interrupt-driven operating system,
`the interrupt subsystem must be viewed as a com-
`ponent of the scheduling system, since it has a major
`role in determining what code runs when. We have
`observed that interrupt-driven systems have trouble
`meeting the requirements discussed in section 3.
`In this section, we first describe the characteristics
`of an interrupt-driven system, and then identify three
`kinds of problems causes by network input overload
`in interrupt-driven systems:
`• Receive livelocks under overload: delivered
`throughput drops to zero while the input over-
`load persists.
`• Increased latency for packet delivery or for-
`warding: the system delays the delivery of one
`packet while it processes the interrupts for sub-
`sequent packets, possibly of a burst.
`• Starvation of packet transmission: even if the
`CPU keeps up with the input load, strict priority
`assignments may prevent it from transmitting
`any packets.
`
`4.1. Description of an interrupt-driven system
`An interrupt-driven system performs badly under
`network input overload because of the way in which
`it prioritizes the tasks executed as the result of net-
`work input. We begin by describing a typical operat-
`ing system’s structure for processing and prioritizing
`network tasks. We use the 4.2BSD [5] model for our
`example, but we have observed that other operating
`systems, such as VMS, DOS, and Windows NT,
`and even several Ethernet chips, have similar charac-
`teristics and hence similar problems.
`When a packet arrives, the network interface sig-
`nals this event by interrupting the CPU. Device in-
`terrupts normally have a fixed Interrupt Priority
`Level (IPL), and preempt all tasks running at a lower
`
`IPL; interrupts do not preempt tasks running at the
`same IPL. The interrupt causes entry into the as-
`sociated network device driver, which does some in-
`itial processing of the packet. In 4.2BSD, only buffer
`management and data-link layer processing happens
`at ‘‘device IPL.’’ The device driver then places the
`packet on a queue, and generates a software interrupt
`to cause further processing of the packet. The
`software interrupt is taken at a lower IPL, and so this
`protocol processing can be preempted by subsequent
`interrupts. (We avoid lengthy periods at high IPL, to
`reduce latency for handling certain other events.)
`The queues between steps executed at different
`IPLs provide some insulation against packet losses
`due to transient overloads, but typically they have
`fixed length limits. When a packet should be queued
`but the queue is full, the system must drop the packet.
`The selection of proper queue limits, and thus the
`allocation of buffering among layers in the system, is
`critical to good performance, but beyond the scope of
`this paper.
`Note that the operating system’s scheduler does
`not participate in any of this activity, and in fact is
`entirely ignorant of it.
`As a consequence of this structure, a heavy load
`of incoming packets could generate a high rate of
`interrupts at device IPL. Dispatching an interrupt is a
`costly operation, so to avoid this overhead, the net-
`work device driver attempts to batch interrupts. That
`is, if packets arrive in a burst, the interrupt handler
`attempts to process as many packets as possible be-
`fore returning from the interrupt. This amortizes the
`cost of processing an interrupt over several packets.
`Even with batching, a system overloaded with in-
`put packets will spend most of its time in the code
`that runs at device IPL. That is, the design gives
`absolute priority to processing incoming packets. At
`the time that 4.2BSD was developed, in the early
`1980s, the rationale for this was that network adap-
`ters had little buffer memory, and so if the system
`failed to move a received packet promptly into main
`memory, a subsequent packet might be lost. (This is
`still a problem with low-cost interfaces.) Thus, sys-
`tems derived from 4.2BSD do minimal processing at
`device IPL, and give this processing priority over all
`other network tasks.
`Modern network adapters can receive many back-
`to-back packets without host intervention, either
`through the use of copious buffering or highly
`autonomous DMA engines. This insulates the system
`from the network, and eliminates much of the
`rationale for giving absolute priority to the first few
`steps of processing a received packet.
`
`Petitioner Apple Inc. - Exhibit 1009, p. 4
`
`
`
`4.2. Receive livelock
`In an interrupt-driven system, receiver interrupts
`take priority over all other activity. If packets arrive
`too fast, the system will spend all of its time process-
`ing receiver interrupts. It will therefore have no
`resources left to support delivery of the arriving
`packets to applications (or, in the case of a router, to
`forwarding and transmitting these packets). The use-
`ful throughput of the system will drop to zero.
`Following [11], we refer to this condition as
`receive livelock: a state of the system where no useful
`progress is being made, because some necessary
`resource
`is entirely consumed with processing
`receiver interrupts. When the input load drops suf-
`ficiently, the system leaves this state, and is again
`able to make forward progress. This is not a dead-
`lock state, from which the system would not recover
`even when the input rate drops to zero.
`A system could behave in one of three ways as the
`input load increases. In an ideal system, the
`delivered throughput always matches the offered
`load. In a realizable system, the delivered throughput
`keeps up with the offered load up to the Maximum
`Loss Free Receive Rate (MLFRR), and then is rela-
`tively constant after that. At loads above the
`MLFRR, the system is still making progress, but it is
`dropping some of the offered input; typically, packets
`are dropped at a queue between processing steps that
`occur at different priorities.
`In a system prone to receive livelock, however,
`throughput decreases with increasing offered load,
`for input rates above the MLFRR. Receive livelock
`occurs at the point where the throughput falls to zero.
`A livelocked system wastes all of the effort it puts
`into partially processing received packets, since they
`are all discarded.
`Receiver-interrupt batching complicates the situa-
`tion slightly. By improving system efficiency under
`heavy load, batching can increase the MLFRR.
`Batching can shift the livelock point but cannot, by
`itself, prevent livelock.
`In section 6.2, we present measurements showing
`how livelock occurs in a practical situation. Ad-
`ditional measurements, and a more detailed discus-
`sion of the problem, are given in [11].
`
`4.3. Receive latency under overload
`Although interrupt-driven designs are normally
`thought of as a way to reduce latency, they can ac-
`tually increase the latency of packet delivery. If a
`burst of packets arrives too rapidly, the system will
`do link-level processing of the entire burst before do-
`ing any higher-layer processing of the first packet,
`because link-level processing is done at a higher
`
`priority. As a result, the first packet of the burst is
`not delivered to the user until link-level processing
`has been completed for all the packets in the burst.
`The latency to deliver the first packet in a burst is
`increased almost by the time it takes to receive the
`entire burst. If the burst is made up of several inde-
`pendent NFS RPC requests, for example, this means
`that the server’s disk sits idle when it could be doing
`useful work.
`One of the authors has previously described ex-
`periments demonstrating this effect [12].
`
`4.4. Starvation of transmits under overload
`In most systems, the packet transmission process
`consists of selecting packets from an output queue,
`handing them to the interface, waiting until the inter-
`face has sent the packet, and then releasing the as-
`sociated buffer.
`Packet transmission is often done at a lower
`priority than packet reception. This policy is super-
`ficially sound, because it minimizes the probability of
`packet loss when a burst of arriving packets exceeds
`the available buffer space. Reasonable operation of
`higher level protocols and applications, however, re-
`quires that transmit processing makes sufficient
`progress.
`When the system is overloaded for long periods,
`use of a fixed lower priority for transmission leads to
`reduced throughput, or even complete cessation of
`packet transmission. Packets may be awaiting trans-
`mission, but the transmitting interface is idle. We
`call this transmit starvation.
`Transmit starvation may occur if the transmitter
`interrupts at a lower priority than the receiver; or if
`they interrupt at the same priority, but the receiver’s
`events are processed first by the driver; or if trans-
`mission completions are detected by polling, and the
`polling is done at a lower priority than receiver event
`processing.
`effect
`This
`previously [12].
`
`has
`
`also
`
`been
`
`described
`
`5. Avoiding livelock through better
`scheduling
`In this section, we discuss several techniques to
`avoid receive livelocks. The techniques we discuss
`in this section include mechanisms to control the rate
`of incoming interrupts, polling-based mechanisms to
`ensure fair allocation of resources, and techniques to
`avoid unnecessary preemption.
`
`Petitioner Apple Inc. - Exhibit 1009, p. 5
`
`
`
`5.1. Limiting the interrupt arrival rate
`We can avoid or defer receive livelock by limiting
`the rate at which interrupts are imposed on the sys-
`tem. The system checks to see if interrupt processing
`is taking more than its share of resources, and if so,
`disables interrupts temporarily.
`The system may infer impending livelock because
`it is discarding packets due to queue overflow, or
`because high-layer protocol processing or user code
`are making no progress, or by measuring the fraction
`of CPU cycles used for packet processing. Once the
`system has invested enough work in an incoming
`packet to the point where it is about to be queued, it
`makes more sense to process that packet to comple-
`tion than to drop it and rescue a subsequently-
`arriving packet from being dropped at the receiving
`interface, a cycle that could repeat ad infinitum.
`When the system is about to drop a received
`packet because an internal queue is full, this strongly
`suggests that it should disable input interrupts. The
`host can then make progress on the packets already
`queued for higher-level processing, which has the
`side-effect of freeing buffers to use for subsequent
`received packets. Meanwhile, if the receiving inter-
`face has sufficient buffering of its own, additional
`incoming packets may accumulate there for a while.
`We also need a trigger for re-enabling input inter-
`rupts, to prevent unnecessary packet loss. Interrupts
`may be re-enabled when internal buffer space be-
`comes available, or upon expiration of a timer.
`We may also want the system to guarantee some
`progress for user-level code. The system can observe
`that, over some interval, it has spent too much time
`processing packet input and output events, and tem-
`porarily disable interrupts to give higher protocol
`layers and user processes time to run. On a processor
`with a fine-grained clock register, the packet-input
`code can record the clock value on entry, subtract
`that from the clock value seen on exit, and keep a
`sum of the deltas. If this sum (or a running average)
`exceeds a specified fraction of the total elapsed time,
`the kernel disables input interrupts.
`(Digital’s
`GIGAswitch
`system
`uses
`a
`similar
`mechanism [15].)
`On a system without a fine-grained clock, one can
`crudely simulate this approach by sampling the CPU
`state on every clock interrupt (clock interrupts typi-
`cally preempt device interrupt processing). If the
`system finds itself in the midst of processing inter-
`rupts for a series of such samples, it can disable inter-
`rupts for a few clock ticks.
`
`5.2. Use of polling
`Limiting the interrupt rate prevents system satura-
`tion but might not guarantee progress; the system
`must also fairly allocate packet-handling resources
`between input and output processing, and between
`multiple interfaces. We can provide fairness by care-
`fully polling all sources of packet events, using a
`round-robin schedule.
`In a pure polling system, the scheduler would in-
`voke the device driver to ‘‘listen’’ for incoming
`packets and for transmit completion events. This
`would control the amount of device-level processing,
`and could also fairly allocate resources among event
`sources, thus avoiding livelock. Simply polling at
`fixed intervals, however, adds unacceptable latency
`to packet reception and transmission.
`Polling designs and interrupt-driven designs differ
`in their placement of policy decisions. When the
`behavior of tasks cannot be predicted, we rely on the
`scheduler and the interrupt system to dynamically al-
`locate CPU resources. When tasks can be expected
`to behave in a predictable manner, the tasks them-
`selves are better able to make the scheduling deci-
`sions, and polling depends on voluntary cooperation
`among the tasks.
`Since a purely interrupt-driven system leads to
`livelock, and a purely polling system adds unneces-
`sary latency, we employ a hybrid design, in which the
`system polls only when triggered by an interrupt, and
`interrupts happen only while polling is suspended.
`During low loads, packet arrivals are unpredictable
`and we use interrupts to avoid latency. During high
`loads, we know that packets are arriving at or near
`the system’s saturation rate, so we use polling to en-
`sure progress and fairness, and only re-enable inter-
`rupts when no more work is pending.
`
`5.3. Avoiding preemption
`As we showed in section 4.2, receive livelock oc-
`curs because interrupt processing preempts all other
`packet processing. We can solve this problem by
`making
`higher-level
`packet
`processing
`non-
`preemptable. We observe that this can be done fol-
`lowing one of two general approaches: do (almost)
`everything at high IPL, or do (almost) nothing at high
`IPL.
`Following the first approach, we can modify the
`4.2BSD design (see section 4.1) by eliminating the
`software interrupt, polling interfaces for events, and
`processing received packets to completion at device
`IPL. Because higher-level processing occurs at
`device IPL, it cannot be preempted by another packet
`arrival, and so we guarantee that livelock does not
`occur within the kernel’s protocol stack. We still
`
`Petitioner Apple Inc. - Exhibit 1009, p. 6
`
`
`
`need to use a rate-control mechanism to ensure
`progress by user-level applications.
`In a system following the second approach, the
`interrupt handler runs only long enough to set a ‘‘ser-
`vice needed’’ flag, and to schedule the polling thread
`if it is not already running. The polling thread runs at
`zero IPL, checking the flags to decide which devices
`need service. Only when the polling thread is done
`does it re-enable the device interrupt. The polling
`thread can be interrupted at most once by each
`device, and so it progresses at full speed without in-
`terference.
`Either approach eliminates the need to queue
`packets between the device driver and the higher-
`level protocol software, although if the protocol stack
`must block, the incoming packet must be queued at a
`later point. (For example, this would happen when
`the data is ready for delivery to a user process, or
`when an IP fragment is received and its companion
`fragments are not yet available.)
`
`5.4. Summary of techniques
`In summary, we avoid livelock by:
`• Using interrupts only to initiate polling.
`• Using round-robin polling to fairly allocate
`resources among event sources.
`• Temporarily disabling input when feedback
`from a full queue, or a limit on CPU usage,
`indicates that other important tasks are pending.
`• Dropping packets early, rather than late, to
`avoid wasted work. Once we decide to receive
`a packet, we try to process it to completion.
`We maintain high performance by
`• Re-enabling interrupts when no work is pend-
`ing, to avoid polling overhead and to keep
`latency low.
`• Letting the receiving interface buffer bursts, to
`avoid dropping packets.
`• Eliminating the IP input queue, and associated
`overhead.
`We observe, in passing, that inefficient code tends
`to exacerbate receive livelock, by lowering the
`MLFRR of the system and hence increasing the
`likelihood that livelock will occur. Aggressive op-
`timization, ‘‘fast-path’’ designs, and removal of un-
`necessary steps all help to postpone arrival of
`livelock.
`
`6. Livelock in BSD-based routers
`In this section, we consider the specific example
`of an IP packet router built using Digital UNIX
`(formerly DEC OSF/1). We chose this application
`because routing performance is easily measured.
`Also, since firewalls typically use UNIX-based
`
`routers, they must be livelock-proof in order to
`prevent denial-of-service attacks.
`Our goals were to (1) obtain the highest possible
`maximum throughput; (2) maintain high throughput
`even when overloaded; (3) allocate sufficient CPU
`cycles to user-mode tasks; (4) minimize latency; and
`(5) avoid degrading performance in other applica-
`tions.
`
`6.1. Measurement methodology
`Our test configuration consisted of a router-under-
`test connecting two otherwise unloaded Ethernets. A
`source host generated IP/UDP packets at a variety of
`rates, and sent them via the router to a destination
`address.
`(The destination host did not exist; we
`fooled the router by inserting a phantom entry into its
`ARP table.) We measured router performance by
`counting the number of packets successfully for-
`warded in a given period, yielding an average for-
`warding rate.
`router-under-test was a DECstation
`The
`3000/300 Alpha-based system running Digital UNIX
`V3.2, with a SPECint92 rating of 66.2. We chose the
`slowest available Alpha host, to make the livelock
`problem more evident. The source host was a
`DECstation 3000/400, with a SPECint92 rating of
`74.7. We slightly modified its kernel to allow more
`efficient generation of output packets, so that we
`could stress the router-under-test as much as possible.
`In all the trials reported on here, the packet gener-
`ator sent 10000 UDP packets carrying 4 bytes of
`data. This system does not generate a precisely
`paced stream of packets; the packet rates reported are
`averaged over several seconds, and the short-term
`rates varied somewhat from the mean. We calculated
`the delivered packet rate by using the ‘‘netstat’’
`program (on the router machine) to sample the output
`interface count (‘‘Opkts’’) before and after each trial.
`We checked, using a network analyzer on the stub
`Ethernet, that this count exactly reports the number of
`packets transmitted on the output interface.
`
`6.2. Measurements of an unmodified kernel
`We started by measuring the performance of the
`unmodified operating system, as shown in figure 6-1.
`Each mark represents one trial. The filled circles
`show kernel-based forwarding performance, and the
`open squares show performance using the screend
`program [7], used in some firewalls to screen out un-
`wanted packets. This user-mode program does one
`system call per packet; the packet-forwarding path
`includes both kernel and user-mode code. In this
`case, screend was configured to accept all packets.
`
`Petitioner Apple Inc. - Exhibit 1009, p. 7
`
`
`
`Without screend
`
`With screend
`
`2000
`
`8000
`6000
`4000
`Input packet rate (pkts/sec)
`Figure 6-1: Forwarding performance of unmodified kernel
`
`10000
`
`12000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`0
`
`Output packet rate (pkts/sec)
`
`From these tests, it was clear that with screend
`running, the router suffered from poor overload be-
`havior at rates above 2000 packets/sec., and complete
`livelock set in at about 6000 packets/sec. Even with-
`out screend, the router peaked at 4700 packets/sec.,
`and would probably livelock somewhat below the
`maximum Ethernet packet rate of about 14,880
`packets/second.
`
`6.3. Why livelock occurs in the 4.2BSD model
`4.2BSD follows the model described in section
`4.1, and depicted in figure 6-2. The device driver
`runs at interrupt priority level (IPL) = SPLIMP, and
`the IP layer runs via a software interrupt at IPL =
`SPLNET, which is lower than SPLIMP. The queue
`between the driver and the IP code is named
`‘‘ipintrq,’’ and each output interface is buffered by a
`queue of its own. All queues have length limits;
`excess packets are dropped. Device drivers in this
`system implement interrupt batching, so at high input
`rates very few interrupts are actually taken.
`Digital UNIX follows a similar model, with the IP
`layer running as a separately scheduled thread at IPL
`= 0, instead of as a software interrupt handler.
`It is now quite obvious why the system suffers
`from receive livelock. Once the input rate exceeds
`the rate at which the device driver can pull new pack-
`ets out of the interface and add them to the IP input
`queue, the IP code never runs. Thus, it never
`removes packets from its queue (ipintrq), which fills
`up, and all subsequent received packets are dropped.
`The system’s CPU resources are saturated be-
`cause it discards each packet after a lot of CPU time
`has been invested in it at elevated IPL. This is
`foolish; once a packet has made its way through the
`device driver, it represents an investment and should
`be processed to completion if at all possible. In a
`router, this means that the packet should be trans-
`
`mitted on the output interface. When the system is
`overloaded, it should discard packets as early as pos-
`sible (i.e., in the receiving interface), so that dis-
`carded packets do not waste any resources.
`
`6.4. Fixing the livelock problem
`We solved the livelock problem by doing as much
`work as possible in a kernel thread, rather than in the
`interrupt handler, and by eliminating the IP input
`queue and its associated queue manipulations and
`1
`software interrupt (or thread dispatch) . Once we
`decide to take a packet from the receiving interface,
`we try not to discard it later on, since this would
`represent wasted effort.
`We also try to carefully ‘‘schedule’’ the work
`done in this thread. It is probably not possible to use
`the system’s real scheduler to control the