`
`Network Working Group John Nagle
`Request For Comments: 896 6 January 1984
` Ford Aerospace and Communications Corporation
`
` Congestion Control in IP/TCP Internetworks
`
`This memo discusses some aspects of congestion control in IP/TCP
`Internetworks. It is intended to stimulate thought and further
`discussion of this topic. While some specific suggestions are
`made for improved congestion control implementation, this memo
`does not specify any standards.
`
` Introduction
`
`Congestion control is a recognized problem in complex networks.
`We have discovered that the Department of Defense's Internet Pro-
`tocol (IP) , a pure datagram protocol, and Transmission Control
`Protocol (TCP), a transport layer protocol, when used together,
`are subject to unusual congestion problems caused by interactions
`between the transport and datagram layers. In particular, IP
`gateways are vulnerable to a phenomenon we call "congestion col-
`lapse", especially when such gateways connect networks of widely
`different bandwidth. We have developed solutions that prevent
`congestion collapse.
`
`These problems are not generally recognized because these proto-
`cols are used most often on networks built on top of ARPANET IMP
`technology. ARPANET IMP based networks traditionally have uni-
`form bandwidth and identical switching nodes, and are sized with
`substantial excess capacity. This excess capacity, and the abil-
`ity of the IMP system to throttle the transmissions of hosts has
`for most IP / TCP hosts and networks been adequate to handle
`congestion. With the recent split of the ARPANET into two inter-
`connected networks and the growth of other networks with differ-
`ing properties connected to the ARPANET, however, reliance on the
`benign properties of the IMP system is no longer enough to allow
`hosts to communicate rapidly and reliably. Improved handling of
`congestion is now mandatory for successful network operation
`under load.
`
`Ford Aerospace and Communications Corporation, and its parent
`company, Ford Motor Company, operate the only private IP/TCP
`long-haul network in existence today. This network connects four
`facilities (one in Michigan, two in California, and one in Eng-
`land) some with extensive local networks. This net is cross-tied
`to the ARPANET but uses its own long-haul circuits; traffic
`between Ford facilities flows over private leased circuits,
`including a leased transatlantic satellite connection. All
`switching nodes are pure IP datagram switches with no node-to-
`node flow control, and all hosts run software either written or
`heavily modified by Ford or Ford Aerospace. Bandwidth of links
`in this network varies widely, from 1200 to 10,000,000 bits per
`second. In general, we have not been able to afford the luxury
`of excess long-haul bandwidth that the ARPANET possesses, and our
`long-haul links are heavily loaded during peak periods. Transit
`times of several seconds are thus common in our network.
`
`
`
`VIMEO/IAC EXHIBIT 1016
`VIMEO ET AL., v. BT, IPR2019-00833
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`Because of our pure datagram orientation, heavy loading, and wide
`variation in bandwidth, we have had to solve problems that the
`ARPANET / MILNET community is just beginning to recognize. Our
`network is sensitive to suboptimal behavior by host TCP implemen-
`tations, both on and off our own net. We have devoted consider-
`able effort to examining TCP behavior under various conditions,
`and have solved some widely prevalent problems with TCP. We
`present here two problems and their solutions. Many TCP imple-
`mentations have these problems; if throughput is worse through an
`ARPANET / MILNET gateway for a given TCP implementation than
`throughput across a single net, there is a high probability that
`the TCP implementation has one or both of these problems.
`
` Congestion collapse
`
`Before we proceed with a discussion of the two specific problems
`and their solutions, a description of what happens when these
`problems are not addressed is in order. In heavily loaded pure
`datagram networks with end to end retransmission, as switching
`nodes become congested, the round trip time through the net
`increases and the count of datagrams in transit within the net
`also increases. This is normal behavior under load. As long as
`there is only one copy of each datagram in transit, congestion is
`under control. Once retransmission of datagrams not yet
`delivered begins, there is potential for serious trouble.
`
`Host TCP implementations are expected to retransmit packets
`several times at increasing time intervals until some upper limit
`on the retransmit interval is reached. Normally, this mechanism
`is enough to prevent serious congestion problems. Even with the
`better adaptive host retransmission algorithms, though, a sudden
`load on the net can cause the round-trip time to rise faster than
`the sending hosts measurements of round-trip time can be updated.
`Such a load occurs when a new bulk transfer, such a file
`transfer, begins and starts filling a large window. Should the
`round-trip time exceed the maximum retransmission interval for
`any host, that host will begin to introduce more and more copies
`of the same datagrams into the net. The network is now in seri-
`ous trouble. Eventually all available buffers in the switching
`nodes will be full and packets must be dropped. The round-trip
`time for packets that are delivered is now at its maximum. Hosts
`are sending each packet several times, and eventually some copy
`of each packet arrives at its destination. This is congestion
`collapse.
`
`This condition is stable. Once the saturation point has been
`reached, if the algorithm for selecting packets to be dropped is
`fair, the network will continue to operate in a degraded condi-
`tion. In this condition every packet is being transmitted
`several times and throughput is reduced to a small fraction of
`normal. We have pushed our network into this condition experi-
`mentally and observed its stability. It is possible for round-
`trip time to become so large that connections are broken because
`
`
`
`2
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`the hosts involved time out.
`
`Congestion collapse and pathological congestion are not normally
`seen in the ARPANET / MILNET system because these networks have
`substantial excess capacity. Where connections do not pass
`through IP gateways, the IMP-to host flow control mechanisms usu-
`ally prevent congestion collapse, especially since TCP implemen-
`tations tend to be well adjusted for the time constants associ-
`ated with the pure ARPANET case. However, other than ICMP Source
`Quench messages, nothing fundamentally prevents congestion col-
`lapse when TCP is run over the ARPANET / MILNET and packets are
`being dropped at gateways. Worth noting is that a few badly-
`behaved hosts can by themselves congest the gateways and prevent
`other hosts from passing traffic. We have observed this problem
`repeatedly with certain hosts (with whose administrators we have
`communicated privately) on the ARPANET.
`
`Adding additional memory to the gateways will not solve the prob-
`lem. The more memory added, the longer round-trip times must
`become before packets are dropped. Thus, the onset of congestion
`collapse will be delayed but when collapse occurs an even larger
`fraction of the packets in the net will be duplicates and
`throughput will be even worse.
`
` The two problems
`
`Two key problems with the engineering of TCP implementations have
`been observed; we call these the small-packet problem and the
`source-quench problem. The second is being addressed by several
`implementors; the first is generally believed (incorrectly) to be
`solved. We have discovered that once the small-packet problem
`has been solved, the source-quench problem becomes much more
`tractable. We thus present the small-packet problem and our
`solution to it first.
`
` The small-packet problem
`
`There is a special problem associated with small packets. When
`TCP is used for the transmission of single-character messages
`originating at a keyboard, the typical result is that 41 byte
`packets (one byte of data, 40 bytes of header) are transmitted
`for each byte of useful data. This 4000% overhead is annoying
`but tolerable on lightly loaded networks. On heavily loaded net-
`works, however, the congestion resulting from this overhead can
`result in lost datagrams and retransmissions, as well as exces-
`sive propagation time caused by congestion in switching nodes and
`gateways. In practice, throughput may drop so low that TCP con-
`nections are aborted.
`
`This classic problem is well-known and was first addressed in the
`Tymnet network in the late 1960s. The solution used there was to
`impose a limit on the count of datagrams generated per unit time.
`This limit was enforced by delaying transmission of small packets
`
`
`
`3
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`until a short (200-500ms) time had elapsed, in hope that another
`character or two would become available for addition to the same
`packet before the timer ran out. An additional feature to
`enhance user acceptability was to inhibit the time delay when a
`control character, such as a carriage return, was received.
`
`This technique has been used in NCP Telnet, X.25 PADs, and TCP
`Telnet. It has the advantage of being well-understood, and is not
`too difficult to implement. Its flaw is that it is hard to come
`up with a time limit that will satisfy everyone. A time limit
`short enough to provide highly responsive service over a 10M bits
`per second Ethernet will be too short to prevent congestion col-
`lapse over a heavily loaded net with a five second round-trip
`time; and conversely, a time limit long enough to handle the
`heavily loaded net will produce frustrated users on the Ethernet.
`
` The solution to the small-packet problem
`
`Clearly an adaptive approach is desirable. One would expect a
`proposal for an adaptive inter-packet time limit based on the
`round-trip delay observed by TCP. While such a mechanism could
`certainly be implemented, it is unnecessary. A simple and
`elegant solution has been discovered.
`
`The solution is to inhibit the sending of new TCP segments when
`new outgoing data arrives from the user if any previously
`transmitted data on the connection remains unacknowledged. This
`inhibition is to be unconditional; no timers, tests for size of
`data received, or other conditions are required. Implementation
`typically requires one or two lines inside a TCP program.
`
`At first glance, this solution seems to imply drastic changes in
`the behavior of TCP. This is not so. It all works out right in
`the end. Let us see why this is so.
`
`When a user process writes to a TCP connection, TCP receives some
`data. It may hold that data for future sending or may send a
`packet immediately. If it refrains from sending now, it will
`typically send the data later when an incoming packet arrives and
`changes the state of the system. The state changes in one of two
`ways; the incoming packet acknowledges old data the distant host
`has received, or announces the availability of buffer space in
`the distant host for new data. (This last is referred to as
`"updating the window"). Each time data arrives on a connec-
`tion, TCP must reexamine its current state and perhaps send some
`packets out. Thus, when we omit sending data on arrival from the
`user, we are simply deferring its transmission until the next
`message arrives from the distant host. A message must always
`arrive soon unless the connection was previously idle or communi-
`cations with the other end have been lost. In the first case,
`the idle connection, our scheme will result in a packet being
`sent whenever the user writes to the TCP connection. Thus we do
`not deadlock in the idle condition. In the second case, where
`
`
`
`4
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`the distant host has failed, sending more data is futile anyway.
`Note that we have done nothing to inhibit normal TCP retransmis-
`sion logic, so lost messages are not a problem.
`
`Examination of the behavior of this scheme under various condi-
`tions demonstrates that the scheme does work in all cases. The
`first case to examine is the one we wanted to solve, that of the
`character-oriented Telnet connection. Let us suppose that the
`user is sending TCP a new character every 200ms, and that the
`connection is via an Ethernet with a round-trip time including
`software processing of 50ms. Without any mechanism to prevent
`small-packet congestion, one packet will be sent for each charac-
`ter, and response will be optimal. Overhead will be 4000%, but
`this is acceptable on an Ethernet. The classic timer scheme,
`with a limit of 2 packets per second, will cause two or three
`characters to be sent per packet. Response will thus be degraded
`even though on a high-bandwidth Ethernet this is unnecessary.
`Overhead will drop to 1500%, but on an Ethernet this is a bad
`tradeoff. With our scheme, every character the user types will
`find TCP with an idle connection, and the character will be sent
`at once, just as in the no-control case. The user will see no
`visible delay. Thus, our scheme performs as well as the no-
`control scheme and provides better responsiveness than the timer
`scheme.
`
`The second case to examine is the same Telnet test but over a
`long-haul link with a 5-second round trip time. Without any
`mechanism to prevent small-packet congestion, 25 new packets
`would be sent in 5 seconds.* Overhead here is 4000%. With the
`classic timer scheme, and the same limit of 2 packets per second,
`there would still be 10 packets outstanding and contributing to
`congestion. Round-trip time will not be improved by sending many
`packets, of course; in general it will be worse since the packets
`will contend for line time. Overhead now drops to 1500%. With
`our scheme, however, the first character from the user would find
`an idle TCP connection and would be sent immediately. The next
`24 characters, arriving from the user at 200ms intervals, would
`be held pending a message from the distant host. When an ACK
`arrived for the first packet at the end of 5 seconds, a single
`packet with the 24 queued characters would be sent. Our scheme
`thus results in an overhead reduction to 320% with no penalty in
`response time. Response time will usually be improved with our
`scheme because packet overhead is reduced, here by a factor of
`4.7 over the classic timer scheme. Congestion will be reduced by
`this factor and round-trip delay will decrease sharply. For this
`________
` * This problem is not seen in the pure ARPANET case because the
` IMPs will block the host when the count of packets
` outstanding becomes excessive, but in the case where a pure
` datagram local net (such as an Ethernet) or a pure datagram
` gateway (such as an ARPANET / MILNET gateway) is involved, it
` is possible to have large numbers of tiny packets
` outstanding.
`
`
`
`5
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`case, our scheme has a striking advantage over either of the
`other approaches.
`
`We use our scheme for all TCP connections, not just Telnet con-
`nections. Let us see what happens for a file transfer data con-
`nection using our technique. The two extreme cases will again be
`considered.
`
`As before, we first consider the Ethernet case. The user is now
`writing data to TCP in 512 byte blocks as fast as TCP will accept
`them. The user's first write to TCP will start things going; our
`first datagram will be 512+40 bytes or 552 bytes long. The
`user's second write to TCP will not cause a send but will cause
`the block to be buffered. Assume that the user fills up TCP's
`outgoing buffer area before the first ACK comes back. Then when
`the ACK comes in, all queued data up to the window size will be
`sent. From then on, the window will be kept full, as each ACK
`initiates a sending cycle and queued data is sent out. Thus,
`after a one round-trip time initial period when only one block is
`sent, our scheme settles down into a maximum-throughput condi-
`tion. The delay in startup is only 50ms on the Ethernet, so the
`startup transient is insignificant. All three schemes provide
`equivalent performance for this case.
`
`Finally, let us look at a file transfer over the 5-second round
`trip time connection. Again, only one packet will be sent until
`the first ACK comes back; the window will then be filled and kept
`full. Since the round-trip time is 5 seconds, only 512 bytes of
`data are transmitted in the first 5 seconds. Assuming a 2K win-
`dow, once the first ACK comes in, 2K of data will be sent and a
`steady rate of 2K per 5 seconds will be maintained thereafter.
`Only for this case is our scheme inferior to the timer scheme,
`and the difference is only in the startup transient; steady-state
`throughput is identical. The naive scheme and the timer scheme
`would both take 250 seconds to transmit a 100K byte file under
`the above conditions and our scheme would take 254 seconds, a
`difference of 1.6%.
`
`Thus, for all cases examined, our scheme provides at least 98% of
`the performance of both other schemes, and provides a dramatic
`improvement in Telnet performance over paths with long round trip
`times. We use our scheme in the Ford Aerospace Software
`Engineering Network, and are able to run screen editors over Eth-
`ernet and talk to distant TOPS-20 hosts with improved performance
`in both cases.
`
` Congestion control with ICMP
`
`Having solved the small-packet congestion problem and with it the
`problem of excessive small-packet congestion within our own net-
`work, we turned our attention to the problem of general conges-
`tion control. Since our own network is pure datagram with no
`node-to-node flow control, the only mechanism available to us
`
`
`
`6
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`under the IP standard was the ICMP Source Quench message. With
`careful handling, we find this adequate to prevent serious
`congestion problems. We do find it necessary to be careful about
`the behavior of our hosts and switching nodes regarding Source
`Quench messages.
`
` When to send an ICMP Source Quench
`
`The present ICMP standard* specifies that an ICMP Source Quench
`message should be sent whenever a packet is dropped, and addi-
`tionally may be sent when a gateway finds itself becoming short
`of resources. There is some ambiguity here but clearly it is a
`violation of the standard to drop a packet without sending an
`ICMP message.
`
`Our basic assumption is that packets ought not to be dropped dur-
`ing normal network operation. We therefore want to throttle
`senders back before they overload switching nodes and gateways.
`All our switching nodes send ICMP Source Quench messages well
`before buffer space is exhausted; they do not wait until it is
`necessary to drop a message before sending an ICMP Source Quench.
`As demonstrated in our analysis of the small-packet problem,
`merely providing large amounts of buffering is not a solution.
`In general, our experience is that Source Quench should be sent
`when about half the buffering space is exhausted; this is not
`based on extensive experimentation but appears to be a reasonable
`engineering decision. One could argue for an adaptive scheme
`that adjusted the quench generation threshold based on recent
`experience; we have not found this necessary as yet.
`
`There exist other gateway implementations that generate Source
`Quenches only after more than one packet has been discarded. We
`consider this approach undesirable since any system for control-
`ling congestion based on the discarding of packets is wasteful of
`bandwidth and may be susceptible to congestion collapse under
`heavy load. Our understanding is that the decision to generate
`Source Quenches with great reluctance stems from a fear that ack-
`nowledge traffic will be quenched and that this will result in
`connection failure. As will be shown below, appropriate handling
`of Source Quench in host implementations eliminates this possi-
`bility.
`
` What to do when an ICMP Source Quench is received
`
`We inform TCP or any other protocol at that layer when ICMP
`receives a Source Quench. The basic action of our TCP implemen-
`tations is to reduce the amount of data outstanding on connec-
`tions to the host mentioned in the Source Quench. This control is
`________
` * ARPANET RFC 792 is the present standard. We are advised by
` the Defense Communications Agency that the description of
` ICMP in MIL-STD-1777 is incomplete and will be deleted from
` future revision of that standard.
`
`
`
`7
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`applied by causing the sending TCP to behave as if the distant
`host's window size has been reduced. Our first implementation
`was simplistic but effective; once a Source Quench has been
`received our TCP behaves as if the window size is zero whenever
`the window isn't empty. This behavior continues until some
`number (at present 10) of ACKs have been received, at that time
`TCP returns to normal operation.* David Mills of Linkabit Cor-
`poration has since implemented a similar but more elaborate
`throttle on the count of outstanding packets in his DCN systems.
`The additional sophistication seems to produce a modest gain in
`throughput, but we have not made formal tests. Both implementa-
`tions effectively prevent congestion collapse in switching nodes.
`
`Source Quench thus has the effect of limiting the connection to a
`limited number (perhaps one) of outstanding messages. Thus, com-
`munication can continue but at a reduced rate, that is exactly
`the effect desired.
`
`This scheme has the important property that Source Quench doesn't
`inhibit the sending of acknowledges or retransmissions. Imple-
`mentations of Source Quench entirely within the IP layer are usu-
`ally unsuccessful because IP lacks enough information to throttle
`a connection properly. Holding back acknowledges tends to pro-
`duce retransmissions and thus unnecessary traffic. Holding back
`retransmissions may cause loss of a connection by a retransmis-
`sion timeout. Our scheme will keep connections alive under
`severe overload but at reduced bandwidth per connection.
`
`Other protocols at the same layer as TCP should also be respon-
`sive to Source Quench. In each case we would suggest that new
`traffic should be throttled but acknowledges should be treated
`normally. The only serious problem comes from the User Datagram
`Protocol, not normally a major traffic generator. We have not
`implemented any throttling in these protocols as yet; all are
`passed Source Quench messages by ICMP but ignore them.
`
` Self-defense for gateways
`
`As we have shown, gateways are vulnerable to host mismanagement
`of congestion. Host misbehavior by excessive traffic generation
`can prevent not only the host's own traffic from getting through,
`but can interfere with other unrelated traffic. The problem can
`be dealt with at the host level but since one malfunctioning host
`can interfere with others, future gateways should be capable of
`defending themselves against such behavior by obnoxious or mali-
`cious hosts. We offer some basic self-defense techniques.
`
`On one occasion in late 1983, a TCP bug in an ARPANET host caused
`the host to frantically generate retransmissions of the same
`datagram as fast as the ARPANET would accept them. The gateway
`________
` * This follows the control engineering dictum "Never bother
` with proportional control unless bang-bang doesn't work".
`
`
`
`8
`
`
`
`
`RFC 896 Congestion Control in IP/TCP Internetworks 1/6/84
`
`
`that connected our net with the ARPANET was saturated and little
`useful traffic could get through, since the gateway had more
`bandwidth to the ARPANET than to our net. The gateway busily
`sent ICMP Source Quench messages but the malfunctioning host
`ignored them. This continued for several hours, until the mal-
`functioning host crashed. During this period, our network was
`effectively disconnected from the ARPANET.
`
`When a gateway is forced to discard a packet, the packet is
`selected at the discretion of the gateway. Classic techniques
`for making this decision are to discard the most recently
`received packet, or the packet at the end of the longest outgoing
`queue. We suggest that a worthwhile practical measure is to dis-
`card the latest packet from the host that originated the most
`packets currently queued within the gateway. This strategy will
`tend to balance throughput amongst the hosts using the gateway.
`We have not yet tried this strategy, but it seems a reasonable
`starting point for gateway self-protection.
`
`Another strategy is to discard a newly arrived packet if the
`packet duplicates a packet already in the queue. The computa-
`tional load for this check is not a problem if hashing techniques
`are used. This check will not protect against malicious hosts
`but will provide some protection against TCP implementations with
`poor retransmission control. Gateways between fast local net-
`works and slower long-haul networks may find this check valuable
`if the local hosts are tuned to work well with the local network.
`
`Ideally the gateway should detect malfunctioning hosts and
`squelch them; such detection is difficult in a pure datagram sys-
`tem. Failure to respond to an ICMP Source Quench message,
`though, should be regarded as grounds for action by a gateway to
`disconnect a host. Detecting such failure is non-trivial but is
`a worthwhile area for further research.
`
` Conclusion
`
`The congestion control problems associated with pure datagram
`networks are difficult, but effective solutions exist. If IP /
`TCP networks are to be operated under heavy load, TCP implementa-
`tions must address several key issues in ways at least as effec-
`tive as the ones described here.
`
`
`9
`
`