`
`Christopher A. Kent
`Jefirey C. Mogul
`
`Digital Equipment Corporation
`Western Research Lab
`
`(Originally published in Proc. SIGCOMM ‘87, Vol. 17, No. 5, October 1937)
`
`Abstract
`
`Internetworks can be built from many different kinds of networks, with varying limits on
`maximum packet size. Throughput is usually maximized when the largest possible packet is
`sent; unfortunately, some routes can carry only very small packets. The IP protocol allows a
`gateway to fragment a packet if it is too large to be transmitted. Fragmentation is at best a
`necessary evil; it can lead to poor performance or complete communication failure. There are a
`variety of ways to reduce the likelihood of fragmentation; some can be incorporated into exist-
`ing IP implementations without changes in protocol specifications. Others require new
`protocols, or modifications to existing protocols.
`
`1. Introduction
`
`Internetworks built of heterogeneous networks are
`valuable because they insulate higher—level protocols
`from changes in network technology, because they al-
`10w universal cummunication without the expense of
`constructing a homogeneous universal
`infrastructure,
`and because they allow the use of different network
`technologies as appropriate to both local-area and long-
`haul
`links. Most datagram networks set a maximum
`limit on tile size of packets they carry,
`to simplify
`packet buffering in the nodes and to limit how long one
`packet can lie up the link. In a heterogeneous interned
`such as the DARPA IP Internet, these packet—size limits,
`known as MTUs (for maximum transmission unit) vary
`widely from 254 bytes for Packet Radio networks to
`2000 bytes for the Wideband Satellite Network [22];
`since nobody knows exactly what is connected to the
`Internet, the range in MTUs may be even broader.
`
`In general, it is better to use a few large packets instead
`of many small packets to carry a given amount of data,
`because much of the cost of packetized communication
`is per—packet rather than per-byte. On a high—speed
`LAN,
`throughput can increase almost
`linearly with
`packet size over a wide range of sizes. Therefore, we
`prefer to make our packets as large as possible.
`
`This desire for large packets conflicts with the variation
`in MTUs across an intemet. We want to send large
`
`packets but some network along the packets’ path may
`not be able to carry them. One approach to this dilemma
`is fragmentation when a node must transmit a packet
`that is larger than the MTU of the network. it breaks the
`packet into several smaller fragments and sends them
`instead. If the fragments are all sent along the same data
`link and are immediately reassembled at the next node,
`this is called transparent or intro-network fragmenta-
`tion. If the fragments are allowed to follow independent
`routes, and are reassembled only upon reaching their
`ultimate destination this is called inter-nemark frag-
`mentation. A good discussion of both methods, in more
`detail, may be found in Shoch [23].
`
`In this paper, drawing on experience with a large
`heterogeneous internetwork, we examine fragmentation
`in the context of the IP protocol [18]. IP supports the
`use
`of
`inter~network
`fragmentation.
`(Transparent
`fragmentation may be also be used as long as it
`is
`invisible to the [P layer.) Fragmentation appears at first
`to be an elegant solution to the problem, but subtle
`complications arise in real networks that can result in
`poor performance or even total communication failure.
`
`Experience with inter-network fragmentation in the 1?
`Internet has convinced us that it is something to avoid.
`In section 2 we compare the advantages and dis
`advantages of fragmentation,
`in order to justify this
`assertion. We then discuss.
`in section 3, a variety of
`schemes for avoiding or recovering from fragmentation
`
`ACM SIGCOMM
`
`-75-
`
`Computer Communication Review
`
`INTEL EX. 1420.001
`
`INTEL EX. 1420.001
`
`
`
`2. What is wrong with fragmentation?
`
`The arguments in favor of fragmentation are straight~
`forward. Fragmentation allows higher level protocols to
`be
`unconcerned with
`the
`characteristics of
`the
`
`transmission channel. and to send data in conveniently
`sized pieces. Sending larger quantities of data in each IP
`datagram minimizes the bookkeeping overhead asso-
`ciated with managing the data. (See section 3.5 fowl
`specific example.)
`
`Fragmentation allows the source host to deal with routes
`having different MTUs without having to know what
`path packet are taking. The safest strategy is for the
`source to send very small datagrams, at a great loss of
`efficiency. Fragmentation allows the source to choose a
`size that is “reasonable” and, when that size proves to
`be too large, prevides a mechanism that allows data to
`continue to get through.
`
`fragmentation allows protocols to optimize
`Finally,
`performance for high bandwidth connections. Emerging
`network technologies have larger and larger MTUs.
`Most local networks have MTUs large enough to send
`1024 bytes of user data plus associated overhead in a
`single packet; new technologies will allow ten times
`that. Fragmentation provides a mechanism for deciding
`the actual packet size as late as possible. It especially
`allows protocols to avoid choosing to send small
`datagrams until absolutely necessary. Protocols can
`choose large segment sizes to take advantage of the
`large MTU in a local network, and rely on fragmenta-
`tion at gateways to send the segments through networks
`with small M'I‘Us when needed. If datagrams must
`traverse a route consisting of several high-MTU links
`followed by a low-MTU link, by delaying the use of
`small packets until
`the low—MTU link is
`reached,
`fragmentation allows the use of large packets on the
`initial high MTU links. and thus uses those links more
`efficiently.
`
`The arguments against fragmentation fall
`categories
`
`into three
`
`-
`
`«-
`
`of
`use
`inefficient
`causes
`Fragmentation
`resources: Poor choice of fragment sizes can
`greatly increase the cost of delivering a datagram.
`Additional bandwidth is used for the additional
`
`header information, intermediate gateways must
`expend computational resources to make addi-
`tional routing decisions, and the receiving host
`must reassemble the fragments.
`
`Loss of fragments leads to degraded per-
`formance: Reassembly of IP fragments is not
`very robust. Loss of a single fragment requires
`the higher level protocol to retransmit all of the
`
`data in the original datagram, even if most of the
`fragments were received correctly.
`
`I
`
`the
`reassembly is hard: Given
`Efficient
`likelihood of lost fragments and the information
`present in the IP header, there are many situations
`in which the reassembly process. though straight-
`forward, yields lower than desired performance.
`
`2.1. An overview of fragmentation in IP
`
`IP is a protocol providing unreliable delivery of
`datagrams. IP datagrams are encapsulated in network-
`specific packets. Gateways may fragment an incoming
`packet if it will not fit in a single outgoing packet; in
`this case, each fragment is sent as a separate packet.
`The [P header contains several fields that are used to
`manage fragmentation [18]:
`
`I
`
`Identification: A 16—bit field assigned by the
`sender to aid in assembling the fragments of a
`datagram. The tuple (source, destination, proto-
`col, identification) for a given datagram must be
`unique over all existing datagrams. When a
`packet is fragmented, the value of the Identifica-
`tion field of the original packet is cepied into
`each fragment.
`
`I
`
`Time to live (TTL): An 8-bit field that specifies
`the maximum time. measured in seconds. that the
`
`packet may remain in the Internet system. If TTL
`contains the value zero.
`the packet must be
`discarded. The TTL must be decreased by at least
`one every time the packet passes through a
`gateway, even if the time required to process the
`packet is less than a second. Thus, the 'I'I‘L field
`is an upper bound on packet lifetime.
`
`-
`
`Fragment offset: A 13—bit field that identifies the
`fragment location, relative to the beginning of the
`Original, unfragmented datagram. Fragment off-
`sets are in units of 8 bytes.
`
`indicates
`field that
`0 More fragments: A l-bit
`whether or not this is the last fragment of the
`datagram.
`
`The reassembly process consists of matching the
`protocol and identification fields of incoming fragments
`with those of fragments already held, and coalescing the
`data into complete datagrams. Fragments must be
`discarded if their TTL expires while they are held for
`reassembly.
`(For more details of
`the reassembly
`algorithm, see [5].)
`
`level protocols such as TCP (Transmission
`Higher
`Control Protocol) [19] use IP as a basis to implement a
`reliable connection between two client processes.
`Portions of the data stream known as segments are sent
`in individual IP datagrams, along with control informa—
`
`ACM SIGCOMM
`
`-78-
`
`Computer Communication Fleview
`
`INTEL EX. 1420.002
`
`INTEL EX. 1420.002
`
`
`
`tion used by the cooperating TCP processes to ensure
`reliable communication.
`In particular, TCP uses a
`sequence number that covers individual bytes in the
`data stream, and an acknowledgment mechanism that
`allows the receiving process to tell the sender “I have
`correctly received all data up to and including sequence
`number n."
`
`2.2. Fragmentation cauSes inefficient resource usage
`
`Consider the costs associated with sending a packet.
`Each time it passes through a gateway, there is some
`constant computational overhead to make
`routing
`decisions, modify the packet header, compute the new
`checksum, and move the packet between the appropriate
`incoming and outgoing queues. In addition, a portion of
`the available bandwidth on the incoming and outgoing
`interfaces is consumed.
`In many cases,
`the constant
`computational overhead dominates the cost. Input and
`output may be overlapped using DMA devices;
`in a
`typical uniprocessor gateway,
`there is no way to
`parallelize the computational overhead.
`
`Fragmenting at an IP gateway, rather than having the
`host choose the appropriate segment size to avoid
`fragmentation, may lead to suboptimal use of gateway
`resources and network bandwidth. Consider a TCP
`
`process that tries to send 1024 data bytes across a route
`that includes the ARPAnet, which has an MTU of 1006
`
`bytes. The IP and TCP headers are at least 40 bytes
`long, leading to a total unfragmented IP datagram 1064
`bytes in length. To cross the ARPAnel,
`this will be
`broken into a 1006 byte fragment, followed by a 3’8 byte
`fragment. These short fragments amortize the fixed
`overhead per ARPAnet packet over very few bytes of
`data, and the total packet count is much higher than
`needed. If the sending TCP instead chooses segments
`that fit in a 1006 byte ARPAnet packet, the total packet
`count is minimized, and the total overhead is as low as
`
`possible.
`
`For example, consider sending 10 Kbytes of data.
`Sending 1024—byte TCP segments generates 10 IP
`datagrams, each 1064 bytes long. Each datagram is
`fragmented into two ARPAnet packets, one 1006 bytes
`long and the other 78 bytes, for a total of 20 packets. If
`the originating TCP instead sends 966 byte segments
`(the largest that will fit in a single ARPAnet packet),
`only 1 1 packets are sent.
`
`Another limit to utilizing available bandwidth lies in the
`interaction of the TI'L and Identification fields. Assume
`that a reasonable initial value for the 'ITL field is 32
`
`(the maximum hop count from edge to edge of the
`DARPA Internet is currently estimated to be between
`15 and 20). If we allow fragmentation, we must ensure
`that all datagrams in flight have unique values for the
`
`Identification field. Thus, the maximum datagram rate is
`215.82, or 2048 datagrams per
`second. Current
`gateways can forward nearly 1000 packets per second;
`high performance workstatiOns
`and interfaces can
`generate packets much more rapidly, and can probably
`forward 4000 packets per second. We are certainly
`within five years of having commonly available
`processor and network technology that pushes against
`the limit imposed by the 16—bit Identification field.
`
`to increase bandwidth in the
`This limit implies that,
`presence of fragmentation, hosts should send larger
`datagrams. so as to carry more data per value of the
`Identification field. This is a bad idea, because large
`datagrams lead to more fragments, and we shall show
`that this increases the likelihood of a severe decrease in
`
`performance. If we simply avoid fragmented datagrams.
`values of the Identification field need not be unique,
`and there is no bandwidth limit imposed by its size.
`
`2.3. Poor performance when fragments are lost
`
`When segments are sent that are large enough to require
`fragmentation,
`the loss of any fragment requires the
`entire segment to be retransmitted. This can lead to
`poorer performance than would have been achieved by
`originally sending segments that didn‘t require frag-
`mentation.
`
`Gateways in the Internet must drop packets when
`congested. If the gateways are congested, dropping
`fragments only makes the situation worse. Dropped
`fragments mean increased retransmissions, which leads
`to more fragments. As the loss rate goes up due to
`heavy
`congestion,
`the
`total
`throughput
`drops
`dramatically, since the loss of any one fragment means
`that
`the resources expended in sending the other
`fragments of that datagram are entirely wasted.
`
`Even when congestion is not the problem, retransmis-
`sion does not necessarily increase the likelihood that all
`the fragments that make up the segment will arrive
`unscathed.
`In particular, network idiosyncrasies may
`conspire to cause the same fragment or fragments to be
`lost on successive retransmission. We call this deter—
`
`minisri'c fragment lass.
`
`An example of deterministic fragment loss occurs in the
`4.ZBSD Unix implementation of TCP when datagrams
`pass between a local network (typically an Ethernet or a
`Proteon ring, with MTUs of 1500 or 2046 bytes,
`respectively) and the ARPAnet. The TCP prefers to
`send 1024 byte data segments, which are transmitted in
`1064 byte IP datagrams. As seen earlier, this results in
`two fragments, 1006 and 78 bytes long.
`
`The receiving gateway receives both fragments and
`sends them out over the local Proteon ring. The Proteon
`
`ACM SIGCOMM
`
`-77-
`
`Computer Communication Review
`
`INTEL EX. 1420.003
`
`INTEL EX. 1420.003
`
`
`
`ring interface does not have sufficient buffering to
`receive back-to-back packets, so it consistently drops
`the second fragment. The sending TCP times out, and
`retransmits the 1024 byte segment, which will again be
`fragmented. The second fragment
`is again lost.
`the
`segment
`times out, and eventually the connection is
`broken.
`
`In addition, many of the gateways in the Internet today
`are derived from 4.2BSD Unix. This implementation of
`IP does not properly fragment a previously fragmented
`packet, preventing some fragments from ever reaching
`their destination, which might better be called gum:
`anteed fragment loss.
`
`2.4. Efficient reassembly is difficult
`
`Reassembling fragments into datagrams at the IP layer
`is considerably less robust than constructing a reliable
`stream at the TCP layer. The windOw mechanism in
`TCP allows the reassembly process to accurately gauge
`how much buffer space to allocate for the current
`stream of unacknowledged data bytes. Also, because in
`TCP the data stream is covered by a sequence number
`for each data byte, once a contiguous sequence of bytes
`at the beginning of the outstanding data stream has been
`reassembled, it can be acknowledged and handed up to
`the next layer. Thus, progress can always be made, even
`if in small amounts.
`
`At the IP layer, there is no indication in the header of a
`fragmented packet of how many other fragments follow,
`or of the length of the entire datagram. The More
`Fragments bit tells only if this the last fragment of the
`datagram, and the Fragment Offset field tells only the
`position of this fragment in the complete datagram. If
`the total size of the incoming datagram is too large to fit
`available buffer space, no progress can be made. The IP
`specification requires hosts to be able to reassemble
`datagrams at least 576 bytes in length; larger segment
`sizes must be explicitly negotiated by higher level
`protocols.
`
`Even if there is sufficient buffer space to reassemble a
`very large datagram, conflicts can occur. In the Internet,
`it is possible for fragments of the same datagram to take
`different routes to their ultimate destination. Depending
`on queue management strategies at gateways along the
`way, a fragment of a small datagram may arrive
`intermixed with the fragments of a large datagram.
`More concretely. assume two datagrams, L (large) and
`S (small), are fragmented as LILQLthLsLGLng and
`$132. If there are only eight buffers available, and the
`reception order is LIL2L3L4L5LgLTSlL881, reassembly of
`L cannot succeed, despite adequate buffer space. Upon
`reception of 8., the reassembly process could discard L]
`through L»;, which would leave six free buffers and
`
`allow S to be reassembled when S; arrives. Or, it could
`discard L3 (and subsequently 52). blocking reassembly
`of both L and S; the buffers would be kept full until the
`fragments expire.
`In either case,
`the work done to
`transport all the fragments of L is entirely wasted. It is
`not possible to coalesce a complete initial string of
`fragments and partially acknowledge receipt of the
`datagram in order to free some of the buffer space.
`(Dave Mills first pointed out this behavior in [13].)
`
`It is difficult to decide how long to hold on to received
`fragments. The only firm limit
`is the 'ITL field;
`the
`reassembly process must discard fragments as their
`TTLs expire. Since each gateway decrements the TH.
`field, it must be set high enough to traverse the longest
`possible route, and thus may still be quite high when the
`packet arrives at
`its destination. Naive use of the
`received T'I'L as a reassembly timeout will cause some
`fragments to occupy buffer space for a much longer
`time than necessary. Use of too short a reassembly
`timeout will cause fragments to be dropped too quickly,
`leading to unnecessary retransmissions.
`
`Because IP is a datagram protocol, there is no guarantee
`that a given fragment will ever arrive. A higher level
`protocol may retransmit a lost IP datagram. If a retrans-
`mitted datagram does not have the same value for the IP
`Identification field,
`its data will not be recognized as
`being the same as that in previously received fragments.
`The old fragments will occupy buffer space until timed
`out or forced out by incoming packets, and cannot fill
`holes left by fragments dropped from the second data-
`gram. This suggests that higher level protocols should
`attempt to use the same value for the IP Identification
`on both the original and retransmitted data. (This idea
`was proposed by John Shriver [24].)
`
`3. Avoiding fragmentation
`
`in most circumstances, the potential
`We believe that.
`fragmentation
`far outweigh the
`disadvantages of
`expected advantages. Thus, hosts should avoid sending
`datagrams that are so large that they will be fragmented.
`The length limit can be determined by a variety of
`general approaches:
`
`0 Always send small datagrams: There is some
`datagram size that is small enough to fit without
`fragmentation on any network; we could simply
`send no datagrams larger than this limit.
`
`' Guess minimum MTU of path: Use a heuristic
`to guess the minimum MTU along the path the
`datagram will follow.
`
`0 Discover actual minimum MTU of path: Use a
`protocol to determine the actual minimum MTU
`along the path the datagram will follow.
`
`ACM SIGCOMM
`
`-73-
`
`Computer Communication Fleview
`
`INTEL EX. 1420.004
`
`INTEL EX. 1420.004
`
`
`
`0 Guess or discover MTU and backtrack if
`
`wrong: Since an estimate might be wrong, and a
`discovered MTU may change if a route changes,
`sometimes we may have to adjust the length limit.
`This requires both a mechanism for detecting
`errors, and a mechanism for correcting them.
`
`Later in this section we will discuss more specific
`fragmentation avoidance Schemes.
`
`All these strategies assume that the route the datagrams
`will follow is
`independently determined. If multiple
`routes are available between source and destination, one
`might
`instead try to avoid fragmentation by using
`source-routing to avoid data links with small MTUs.
`Suitable alternate routes seldom exist, however, and
`
`even when they do we see no efficient way for an IP
`host to obtain enough information to choose a good
`source-route.
`
`IP is a layered protocol architecture, and fragmentation
`avoidance must be done at the right layer. It makes little
`sense to build redundant mechanisms into several layers
`if it is possible to do it once. This implies that the right
`place for fragmentation avoidance is the layer commOn
`to all 11’ communication,
`the 1P datagram layer itself
`(and its partner, the ICMP protocol). It would be a poor
`idea
`to
`put
`the
`entire
`fragmentation
`avoidance
`mechanism in, say, the TCP layer, because both the
`mechanism and any additional protocol would have to
`be duplicated in parallel
`layers, such as UDP[17],
`NETBLT[6], and VMTP[3], and because it would be
`awkward for
`a TCP—based mechanism to
`share
`
`knowledge with other layers and across connections.
`
`layers above IP should be
`to say that
`This is not
`uninvolved in fragmentation avoidance. Architectural
`layering does not mean that higher layers must be kept
`ignorant of fragmentation issues. Optimal performance
`depends upon cooperation between layers for example,
`the TCP layer should not send huge segments if the IP
`layer knows that they will be fragmented.
`
`Most of the fragmentation-avoidance schemes we will
`propose depend on keeping some knowledge about the
`minimum MTU (MINMTU) on the path a datagram will
`follow. A MINMTU value could be associated with a
`
`specific destination network. a specific destination host.
`a specific route (there may be several routes to one
`destination, with differing MINMTUs), or a specific
`connection (since for different applications, we may
`want
`to choose between optimizing for maximum
`bandwidth versus minimum delay, and thus might want
`to accept different risks of fragmentation for different
`connections to the same host). The MINMTU values
`could be kept in the IP routing database. or in a separate
`database, especially if per-connection MINMTUs are
`
`wanted. To support pervconnection MlNMTUs, the IP
`layer must obtain
`a
`connection
`identifier
`from
`connection-oriented higher layers.
`
`scheme
`a per~connection
`that
`is our belief
`It
`(degenerating to a per-routc-to-specific-host scheme for
`connectionless protocols)
`is the most
`flexible one.
`While it is true that by keeping perwdestination-network
`information one might be able to pool
`information
`about several hosts, this is not necessarily safe. Because
`many networks are subnetted [15], because MTUS may
`vary among the subnets of a given network, and because
`one cannot tell whether a remote network is subnetted
`
`or not, it is not true that knowing the MLNMTU for one
`host reliably gives you the MINMTU for all other hosts
`on the same network.
`
`Routes in a datagram network are not necessarily
`symmetric; the route a packet takes may not be the
`reverse of the route taken by a packet traveling in the
`opposite direction. Because of this. it is not safe for a
`host to assume that it can send a datagram as large as
`the one it has received from its peer. An independent
`MINMTU determination must be made for each
`
`direction, although the peer hosts may assist each other
`in doing so.
`
`When the 1? layer has determined the MINMTU for a
`connection or destination, it can make this information
`available to higher
`layers.
`such as TCP,
`that are
`generating segments
`to be
`sent as
`IP datagrams
`Segment-generating layers should ask the IP layer for a
`MINMTU before sending a segment; connection—based
`layers
`should either
`check periodically that
`the
`MINMTU has not changed, or should be able to handle
`asynchronous notification of a change.
`
`3.1. Fragmentation avoidance without protocol
`changes
`
`fragmentation
`section we describe several
`In this
`avoidance schemes that can be implemented without
`changing existing protocol specifications or creating
`new protocols. There are obvious advantages to such
`approaches. since they can be taken immediately by
`individual sites or vendors; further, we have sufficient
`experience with one of them to believe that it works
`fairly well. On the other hand, none of these schemes
`can make use of exact knowledge of MINMTUs, and so
`may not provide optimal performance.
`
`3.1.1. Always send tiny datagrams
`
`If a host always sent datagrams no larger than the
`minimum MTU over the entire intemet, these datagrams
`would never be fragmented. In the IP Internet the limit
`is no higher than 254 bytes, and might be lower. Since
`almost all of the Internet supports larger MTUs, and
`
`ACM SIGCOMM
`
`-79-
`
`Computer Communication Review
`
`INTEL EX. 1420.005
`
`INTEL EX. 1420.005
`
`
`
`since performance depends so strongly on packet size,
`this approach can't provide reasonable performance. It
`is worth invoking only as
`a
`temporary diagnostic
`measure if performance actually increases when the
`datagram size is decreased, this is a clear indication that
`inappropriate fragmentation is taking place for larger
`datagrams.
`
`Alternatively, one might assume that using a Sid-byte
`limit is small enough to avoid fragmentation in virtually
`all cases (we hope that in the future, all new LP network
`links would be capable of handling packets of this size).
`576 bytes is set forth in the IP specification [18] as the
`maximum size a host can send without explicit
`permission from the receiving host, so it is reasonable
`as an arbitrary value.
`
`3.1.2. Send 576-byte datagrams if the route goes via
`a gateway
`
`The IP layer can determine if the route for a connection
`or destination goes via a gateway. If it does, then the
`size limit is set to 576 (our favorite arbitrary value);
`otherwise, any size up to the MTU of the data-link layer
`may be used.
`
`This approach provides maximum performance for
`local connections, and reasonable assurance that on
`most non-local connections, datagrams will not be
`fragmented. It is not perfect, since
`
`1.
`
`2.
`
`3.
`
`It does not avoid fragmentation on every path
`
`It may unnecessarily limit packet size, especially
`on subnctted collections of hi gh-speed LAN 5 that
`all support large packets.
`
`If proxy ARP is used [14] then the 1P layer may
`be fooled into believing that a non-local path is
`local, and thus use large datagrams when they are
`not necessarily safe.
`
`However, it is quite easy to implement and in general
`provides good performance. A variant of this scheme,
`implemented in the TCP layer, has been used for
`several years at many sites and is now incorporated in
`4.SBSD Unix [12]. This is the method we recommend
`in the absence of protocol changes.
`
`3.13. Send 576-byte datagrams if the route goes off-
`net
`
`Instead of checking whether a destination is behind a
`gateway,
`the IF layer can examine the destination's
`network number to decide if it is local or non—local. In a
`
`this trades a higher risk of
`subnetted environment,
`guessing too high a MINMTU for higher performance
`within the local collection of subnets.
`
`3.2. Fragmentation avoidance with protocol changes
`
`fragmentation
`section we describe several
`In this
`avoidance schemes that
`require changes to existing
`protocol specifications or the creation of new protocols.
`Mostly, these involve changes to gateways and some
`minor changes to IP-layer software; all are designed so
`as to coexist with unmodified gateways and hosts.
`
`3.2.1. Probe mechanisms
`
`Ideally, for a host to be able to send the largest possible
`datagrams that will not be fragmented,
`it must have
`perfect information as to the MINMTU along the path
`the datagrams will follow. Since most IP hosts do not
`even know what that route is, much less what the MTUs
`
`route, we need a mechanism for
`along the
`are
`discovering MINMTU.
`
`The most straightforward kind of mechanism is to send
`a packet along the route, collecting MTU information as
`it goes; we call
`these probe mechanisms. Probe
`mechanisms require support from gateways each gate—
`way along the route must update the probe according to
`the MTU of the hop it is about to take. Probe mechan—
`isms also require support from peer hosts, since paths
`are aSymmetric, once a probe reaches the end of its
`route, the information it has collected must be returned
`to the source host.
`
`A probe may either gather a list of all the MTUs along
`the path (somewhat analogous to the IP “Record Route"
`option), with which the host
`can determine
`the
`MINMTU. or the probe may simply carry only the
`lowest MTU value seen along the route. The former
`method provides a little more information;
`the latter
`method is easier to implement and results in shorter
`packets.
`
`A probe may be made only once, at the beginning of a
`connection or the use of a route, or it may be made
`periodically. Periodic probes are preferable if
`the
`MINMTU is kept per-destination or per-connection,
`since the route may change. If MINMTU information is
`kept per-route, then it will not change and consequently
`probes need not be repeated.
`
`Probe mechanisms are useful for discovering other path
`characteristics besides MINMTU. As long as one is
`processing a probe, it makes sense to collect a variety of
`information, since it comes at little additional cost. This
`information could include:
`
`Minimum bandwidth
`
`Useful for determining appropriate transmission
`rates; if a host knows that a 9600—baud link is part
`of the path, it should behave differently than if the
`path is entirely via 100 Mbit fiber networks.
`
`ACM SIGCOMM
`
`-80-
`
`Computer Communication Review
`
`INTEL EX. 1420.006
`
`INTEL EX. 1420.006
`
`
`
`Maximum delay
`Useful for determining realistic round—trip times;
`if a satellite channel
`is in use, with a delay of
`several hundred milliseconds, a host should not
`
`retransmit as quickly as if the end-to~end delay
`were several milliseconds.
`
`Maximum queue length
`if measured
`A high value implies congestion;
`using the
`“fair—queueing"
`algorithm [16]
`it
`indicates to a host whether it is sending too much.
`Alternatively,
`a “congestion-encountered” flag
`could be set
`if any gateway along the path is
`experiencing congestion.
`
`Maximum error rate
`
`When a link along the path is experiencing a high
`error rate, a host might choose to send shorter
`packets (so as to reduce the likelihood that an
`entire datagram is dropped because of a single
`error) or use error—correcting codes.
`
`Hop Count
`The total number of links traversed along the
`route may be of interest, for example, in choosing
`a value for the “Time To Live" field. (Collection
`of hop counts was proposed by Mike Karels [10].)
`
`It is not necessary for every gateway along the path to
`support probing, providing they all forward the probe.
`Gaps in the probe information are not fatal; at worst,
`host behavior is the same as if no probing is done. A
`gateway that does support probing can cover up for an
`occasional uncooperative gateway by looking at
`the
`incoming link as well as
`the outgoing link when
`determining the MINMTU.
`
`Since route choices may depend on the IP “Type of
`Service” and perhaps the IP “Security” option, probes
`should carry the same Type of Service and Security as
`the data packets will [4]; gateways should observe Type
`of Service and Security when updating values in probes.
`
`3.2.2. Probing with ICMP messages
`
`A probe can be done using a separate packet; in the IP
`architecture, we would do this using a new ICMP
`“Probe Path” message. This is described in detail
`in
`appendix I.
`
`Briefly, a host wishing to probe a path sets initial values
`for the fields of the Probe Path message, then sends it to
`the destination host. Each gateway along the route
`updates various
`fields of the message. When the
`destination host receives the message,
`it copies the
`recorded information into a different area of
`the
`
`message, reinitializes the recording fields, and returns
`the message to the original host. If the second host
`requests, the message may make one more trip, after
`
`which both hosts will have the path information,
`including MINMTU.
`
`3.23. Probes piggybaeked on IP headers
`
`It is not necessary to send a separate packet to probe the
`path. Instead, the probe information can be piggybacked
`on the actual data packets, as part of the 1P header. In
`appendix [I we describe new IP header options for
`recording
`and
`returning MINMTU information.
`(Additional options could be defined for recording other
`path characteristics.)
`
`In this case, a host wishing to probe a path sets initial
`values for the “Probe MTU" option in the IP header of
`a datagram it is sending. Each gateway along the route
`may update the value carried in this option. When the
`destination host receives the datagram,
`it copies the
`recorded information into a “MTU Reply” option and
`attaches it to the next datagram going back to the source
`host. When this reply is received, the f