`
`the
`be
`
`~ith
`
`3a
`
`TCP: Transmission Control
`Protocol
`
`Introduction
`
`The Transmission Control Protocol, or TCP, provides a connection-oriented, reliable,
`byte-stream service between the two end points of an application. This is completely
`different from UDP’s connectionless, unreliable, datagram service.
`The implementation of UDP presented in Chapter 23 comprised 9 functions and
`about 800 lines of C code. The TCP implementation we’re about to describe comprises
`28 functions and almost 4,500 lines of C code. Therefore we divide the presentation of
`TCP into multiple chapters.
`These chapters are not an introduction to TCP. We assume the reader is familiar
`with the operation of TCP from Chapters 17-24 of Volume 1.
`
`24.2
`
`Code Introduction
`
`The TCP functions appear in six C files and numerous TCP definitions are in seven
`headers, as shown in Figure 24.1.
`Figure 24.2 shows the relationship of the various TCP functions to other kernel
`functions. The shaded ellipses are the nine main TCP functions that we coven Eight of
`these functions appear in the TCP protosw structure (Figure 24.8) and the ninth is
`tcp_output.
`
`795
`
`Ex.1013.821
`
`DELL
`
`
`
`796
`
`TCP: Transmission Control Protocol
`
`Chapter
`
`¯ Section 24
`
`~le
`netinet/tcp.h
`netinet/tcp_debug.h
`netinet/tcp_fsm.h
`netinet/tcp_seq.h
`netinet/tcp_timer.h
`netinet/tcp_var.h
`netinet/tcpip.h
`netinet/tcp_debug-c
`netinet/tcp_input.c
`netinet/tcp_output.c
`netinet/tcp_subr.c
`netinet/tcp_timer.c
`netinet/tcp_usrreq.c
`
`Description
`
`Global
`
`tcphdr structure definition
`t cp_debug structure definition
`definitions for TCP’s finite state machine
`TCP
`
`macros for comparing sequence numbers
`definitions for TCP timers
`t cpcb (control block) and tcp s t a t (statistics) structure definitions
`TCP plus IP header definition
`support for SO_DEBUG socket debugging (Section 27.10)
`t cp_input and ancillary functions (Chapters 28 and 29)
`tcp_output and ancillary functions (Chapter 26)
`miscellaneous TCP subroutines (Chapter 27)
`TCP timer handling (Chapter 25)
`PRU xxx request handling (Chapter 30)
`
`Figure 24.1 Files discussed in the TCP chapters.
`
`system initialization
`
`socket
`receive buffer
`
`various
`system calls
`
`getsockopt
`setsockopt
`
`_ D
`
`Statist
`
`software interrupt
`
`Figure 24.2 Relationship of TCP functions to rest of the kernel.
`
`Ex.1013.822
`
`DELL
`
`
`
`-~hapter 24
`
`n 24.2
`
`Code Introduction 797
`
`tions
`
`,ckopt
`ckopt
`
`}lobal Variables
`
`Figure 24.3 shows the global variables we encounter throughout the TCP functions.
`
`Variable
`
`tcb
`tcp_last_inpcb
`tcpstat
`tcp. outflags
`tcp_recvspace
`tcp_sendspace
`tcp. iss
`tcprexmtthresh
`tcp_mssdflt
`tcp rttdflt
`tcp do rfc1323
`tcp now
`tcp_keepidle
`tcp_keepintvl
`
`Datatype
`struct inpcb
`struct inpcb *
`struct tcpstat
`u_char
`u_long
`u_long
`tcp_seq
`int
`int
`int
`int
`u_long
`int
`int
`
`tcp maxidle
`
`int
`
`Description
`head of the TCP Internet PCB list
`pointer to PCB for last received segment: one-behind cache
`TCP statistics (Figure 24.4)
`array of output flags, indexed by connection state (Figure 24.16)
`default size of socket receive buffer (8192 bytes)
`default size of socket send buffer (8192 bytes)
`initial send sequence number (ISS)
`number of duplicate ACKs to trigger fast retransmit (3)
`default MSS (512 bytes)
`default RTT if no data (3 seconds)
`if true (default), request window scale and timestamp options
`500 ms counter for RFC 1323 timestamps
`keepalive: idle time before first probe (2 hours)
`keepalive: interval between probes when no response (75 sec)
`(also used as timeout for connect)
`keepalive: time after probing before giving up (10 min)
`
`Figure 24.3 Global variables introduced in the following chapters.
`
`Statistics
`
`Various TCP statistics are maintained in the global structure tcpstat, described in Fig-
`ure 24.4. We’ll see where these counters are incremented as we proceed through the
`code.
`Figure 24.5 shows some sample output of these statistics, from the netstat
`command. These statistics were collected after the host had been up for 30 days. Since
`some counters come in pairs--one counts the number of packets and the other the
`number of bytes--we abbreviate these in the figure. For example, the two counters for
`the second line of the table are tcps_sndpack and tcps_sndbyte.
`
`-s
`
`The counter for tcps_sndbyte should be 3,722,884,824, not -22,194,928 bytes. This is an
`
`average of about 405 bytes per segment, which makes sense. Similarly, the counter for
`tcps_rcvackbyte should be 3,738,811,552, not -21,264,360 bytes (for an average of about 565
`bytes per segment). These numbers are incorrectly printed as negative numbers because the
`printf calls in the netstat program use %d (signed decimal) instead of %lu (long integer,
`unsigned decimal). All the counters are unsigned long integers, and these two counters are
`near the maximum value of an unsigned 32-bit long integer (232 - 1 = 4, 294, 967, 295).
`
`Ex.1013.823
`
`DELL
`
`
`
`798
`
`TCP: Transmission Control Protocol
`
`Chapter 24
`
`Section
`24
`
`Ised by I
`
`~N!
`
`tcpstat member
`
`tcps_accepts
`tcps_closed
`tcps_connattempt
`tcps_conndrops
`tcps_connects
`tcps_delack
`tcps_drops
`tcps_keepdrops
`tcps_keepprobe
`tcps_keeptimeo
`tcps_pawsdrop
`tcps_pcbcachemiss
`tcps_persisttimeo
`tcps~redack
`tcps_preddat
`tcps_rcvackbyte
`tcps_rcvackpack
`tcps_rcvacktoomuch
`tcps_rcvafterclose
`tcps_rcvbadoff
`tcps_rcvbadsum
`tcps_rcvbyte
`tcps_rcvbyteafterwin
`tcps_rcvdupack
`tcps_rcvdupbyte
`tcps_rcvduppack
`tcps_rcvoobyte
`tcps_rcvoopack
`tcps_rcvpack
`tcps_rcvpackafterwin
`tcps_rcvpartdupbyte
`tcps_rcvpartduppack
`tcps_rcvshort
`tcps_rcvtotal
`tcps_rcvwinprobe
`tcps_rcv%vinupd
`tcps_rexmttimeo
`tcps_rttupdated
`tcps_segstimed
`tcps_sndacks
`tcps_sndbyte
`tcps_sndctrl
`tcps_sndpack
`tcps_sndprobe
`tcps_sndrexmitbyte
`tcps_sndrexmitpack
`tcps_sndtotal
`tcps_sndurg
`tcps_sndwinup
`tcps_timeoutdrop
`
`Description
`
`#SYNs received in LISTEN state
`#connections closed (includes drops)
`#connections initiated (calls to connect)
`#embryonic connections dropped (before SYN received)
`#connections established actively or passively
`#delayed ACKs sent
`#connections dropped (after SYN received)
`#connections dropped in keepalive (established or awaiting SYN)
`#keepalive probes sent
`#times keepalive timer or connection-establishment timer expire
`#segments dropped due to PAWS
`~times PCB cache comparison fails
`#times persist timer expires
`#times header prediction correct for ACKs
`#times header prediction correct for data packets
`#bytes ACKed by received ACKs
`#received ACK packets
`#received ACKs for unsent data
`#packets received after connection closed
`#packets received with invalid header length
`#packets received with checksum errors
`#bytes received in sequence
`#bytes received beyond advertised window
`#duplicate ACKs received
`#bytes received in completely duplicate packets
`#packets received with completely duplicate bytes
`#out-of-order bytes received
`#out-of-order packets received
`#packets received in sequence
`#packets with some data beyond advertised window
`#duplicate bytes in part-duplicate packets
`#packets with some duplicate data
`#packets received too short
`total #packets received
`#window probe packets received
`#received window update packets
`#retransmit timeouts
`#times RTT estimators updated
`#segments for which TCP tried to measure RTT
`#ACK-only packets sent (data length = 0)
`#data bytes sent
`#control (SYN, FIN, RST) packets sent (data length = 0)
`#data packets sent (data length > 0)
`#window probes sent (1 byte of data forced by persist timer)
`#data bytes retransmitted
`#data packets retransmitted
`total #packets sent
`#packets sent with URG-only (data length = 0)
`#window update-only packets sent (data length = 0)
`#connections dropped in retransmission timeout
`
`
`
`Figure 24.4 TCP statistics maintained in the tcpstat structure.
`
`~, 655,
`9,17
`257,
`862,
`229
`3,45
`74,9
`279,
`8,801,9
`6,61
`235,
`0 ac
`4,67
`46,g
`22 c
`3,44
`77,1
`1,8_c
`1,7[
`175,
`1,03
`60,~
`279
`0 di
`144,02(
`92,595
`126,82(
`237,74[
`ii0, Oil
`6,363,!
`114,79’
`86,
`i, 173 ]
`16,419
`6,8
`3,2
`733,13
`1,266,
`i, 851,
`
`SNMP’
`
`Ex.1013.824
`
`DELL
`
`
`
`24.2
`
`Code Introduction 799
`
`netstat -s output
`10,655,999 packets sent
`9,177,823 data packets (-22,194,928 bytes)
`257,295 data packets (81,075,086 bytes) retransmitted
`862,900 ack-only packets (531,285 delayed)
`229 URG-only packets
`3,453 window probe packets
`74,925 window update packets
`279,387 control packets
`~,801,953 packets received
`6,617,079 acks (for -21,264,360 bytes)
`235,311 duplicate acks
`0 acks for unsent data
`4,670,615 packets (324,965,351 bytes) rcvd in-sequence
`46,953 completely duplicate packets (1,549,785 bytes)
`22 old duplicate packets
`3,442 packets with some dup. data (54,483 bytes duped)
`77,114 out-of-order packets (13,938,456 bytes)
`1,892 packets (1,755 bytes) of data after window
`1,755 window probes
`175,476 window update packets
`1,017 packets received after close
`60,370 discarded for bad checksums
`279 discarded for bad header offset fields
`0 discarded because packet too short
`144,020 connection requests
`92,595 connection accepts
`126,820 connections established (including accepts)
`237,743 connections closed (including 1,061 drops)
`110,016 embryonic connections dropped
`6,363,546 segments updated rtt (of 6,444,667 attempts)
`114,797 retransmit timeouts
`86 connection dropped by rexmit timeout
`1,173 persist timeouts
`16,419 keepalive timeouts
`6,899 keepalive probes sent
`3,219 connections dropped by keepalive
`733,130 correct ACK header predictions
`1,266,889 correct data packet header predictions
`1,851,557 cache misses
`
`tcpstat members
`tcps_sndtotal
`tcps_snd{pack,byte}
`tcps_sndrexmit{pack,byte}
`tcps_sndacks,tcps_delack
`tcps_sndurg
`tcps_sndprobe
`tcps_sndwinup
`tcps_sndctrl
`tcps_rcvtotal
`tcps_rcvack{pack,byte]
`tcps_rcvdupack
`tcps_rcvacktoomuch
`tcps_rcv{pack,byte)
`tcps_rcvdup{pack,byte]
`tcps_pawsdrop
`tcps_rcvpartdup{pack,byte]
`tcps_rcvoo{pack,byte}
`tcps_rcv{pack,byte}afterwin
`tcps_rcvwinprobe
`tcps_rcvwindup
`tcps_rcvafterclose
`tcps_rcvbadsum
`tcps_rcvbadoff
`tcps_rcvshort
`tcps_connattempt
`tcps_accepts
`tcps_connects
`tcps_closed, tcps_drops
`tcps_conndrops
`tcps_{rttupdated, segstimed}
`tcps_rexmttimeo
`tcps_timeoutdrop
`tcps_persisttimeo
`tcps_keeptimeo
`tcps_keepprobe
`tcps_keepdrops
`tcps_predack
`tcps_preddat
`tcps_pcbcachemiss
`
`Figure 24.5 Sample TCP statistics.
`
`SNMP Variables
`
`Figure 24.6 shows the 14 simple SNMP variables in the TCP group and the counters
`from the tcpstat structure implementing that variable. The constant values shown
`for the first four entries are fixed by the Net/3 implementation. The counter
`tcpCurrEstab is computed as the number of Internet PCBs on the TCP PCB list.
`Figure 24.7 shows tcpTabl e, the TCP listener table.
`
`Ex.1013.825
`
`DELL
`
`
`
`800
`
`TCP: Transmission Control Protocol
`
`Chapter 24
`
`SNMP variable
`
`tcpRtoAlgorithm
`
`t cpRt oMin
`t cpRt oMax
`t cpMaxConn
`t cpAct iveOpens
`tcpPassiveOpens
`t cpAt t erupt Fa i 1 s
`
`tcpEstabResets
`
`t cpCurrEstab
`
`t cpI nS eg s
`tcpOutSegs
`
`tcpRetransSegs
`tcpInErrs
`
`t cpOutRs t s
`
`t eps t at members
`or constant
`4
`
`Description
`
`10 00
`64 0 00
`- 1
`t cps_connat tempt
`tcps_accepts
`t cps_conndrops
`
`t cps_dro~)s
`
`(see text)
`
`algorithm used to calculate retransmission timeout value:
`1 = none of the following,
`2 = a constant RTO,
`3 = MIL-STD-1778 Appendix B,
`4 = Van Jacobson’s algorithm.
`minimum retransmission timeout value, in milliseconds
`maximum retransmission timeout value, in milliseconds
`maximum #TCP connections (-1 if dynamic)
`#transitions from CLOSED to SYN_SENT states
`#transitions from LISTEN to SYN_RCVD states
`#transitions from SYN_SENT or SYN_RCVD to CLOSED,
`plus #transitions from SYN_RCVD to LISTEN
`!#transitions from ESTABLISHED or CLOSE_WAlT states to
`CLOSED
`#connections currently in ESTABLISHED or CLOSE_WAIT
`states
`total #segments received
`t cps_r cvt o t a I
`total #segments sent, excluding those containing only
`tcps_sndtotal -
`retransmitted bytes
`tcps_sndrexmitpack
`tcps_sndrexmitpack total #retransmitted segments
`tcps_rcvbadsum +
`total #segments received with an error
`t cps_rcvbadof f +
`tcps_rcvshor t
`(not implemented)
`
`total #segments sent with RST flag set
`
`Figure 24.6 Simple SNMP variables in tep group.
`
`
`
`
`
`
`SNMP variable
`PCB variable
`tcpConnSt ate
`t_state
`
`index = < tcpConnLocalAddress >.< tcpConnLocalPort >.< tcpConnRemAddress >.< tcpConnRemPort >
`Description
`state of connection: 1 = CLOSED, 2 = LISTEN,
`3 = SYN_SENT, 4 = SYN_RCVD, 5 = ESTABLISHED
`6 = FIN_WAIT_l, 7 = FIN_WAIT_2, 8 = CLOSE_WAI
`9 = LAST_ACK, 10 = CLOSING, 11 = TIME_WAIT,
`12 = delete TCP control block.
`local IP address
`local port number
`foreign IP address
`foreign port number
`
`tcpConnLocalAddress
`tcpConnLocalPort
`tcpConnRemAddress
`tcpConnRemPort
`
`inp_laddr
`inp_iport
`inp_faddr
`inp_fport
`
`~
`
`Figure 24.7 Variables in TCP listener table: tcpTable.
`
`The first PCB variable (t_state) is from the TCP control block (Figure 24.13) and
`remaining four are from the Internet PCB (Figure 22.4).
`
`Ex.1013.826
`
`DELL
`
`
`
`24.4
`
`TCP Header
`
`801
`
`!4.3 TCP protosw Structure
`
`Figure 24.8 lists the TCP protosw structure, the protocol switch entry for TCR
`
`Description
`TCP provides a byte-stream service
`TCP is part of the Internet domain
`appears in the ±p~ field of the IP header
`socket layer flags, not used by protocol processing
`receives messages from IP layer
`not used by TCP
`control input function for ICMP errors
`respond to administrative requests from a process
`respond to communication requests from a process
`initialization for TCP
`fast timeout function, called every 200 ms
`slow timeout function, called every 500 ms
`called when kernel runs out of mbufs
`not used by TCP
`
`inetsw[2]
`SOCK_STREAM
`&inetdomain
`IPPROTO_TCP (6)
`PR_CONNREQUIRED/PR WANTRCVD
`tcp_input
`
`0~
`
`cp_ctlinput
`[cp_ctloutput
`tcp_usrreq
`Ecp_init
`tcp_fasttimo
`tcp_slowtimo
`Ecp_drain
`0
`
`Member
`pr_type
`pr_domain
`pr_protocol
`pr_flags
`pr_input
`pr_output
`pr_ctlinput
`pr_ctloutput
`pr_usrreq
`pr_init
`pr_fasttimo
`pr_slowtimo
`pr_drain
`pr_sysctl
`
`
`
`Figure 24.8 The TCP protosw structure.
`
`24.4
`
`TCP Header
`
`The TCP header is defined as
`a tcphdr
`and Figure 24.10 shows a picture of the TCP header.
`
`structure. Figure 24.9 shows the C structure
`
`40 struct tcphdr {
`u_short th_sport;
`41
`u_short th_dport;
`42
`43
`tcp_seq th_seq;
`44
`tcp_seq th_ack;
`45 #if BYTE_ORDER =: LITTLE_ENDIAN
`46
`u_char th_x2:4,
`47
`th_off:4;
`48 #endif
`49 #if BYTE_ORDER == BIG_ENDIAN
`50
`u_char
`th_off:4,
`51
`th_x2:4;
`52 #endif
`th_flags;
`53
`u_char
`54
`u_short th_win;
`55
`u_short th_sum;
`u_short th_urp;
`56
`57 };
`
`/* source port */
`/* destination port */
`/* sequence number */
`/* acknowledgement number *!
`
`/* (unused) */
`/* data offset */
`
`/* data offset */
`/* (unused) */
`
`/* ACK, FIN, PUSH, RST, SYN, URG */
`/* advertised window */
`/* checksum */
`/* urgent offset */
`
`le
`
`Figure 24.9 tcphdr structure.
`
`tcp.h
`
`tcp.h
`
`Ex.1013.827
`
`DELL
`
`
`
`802
`
`TCP: Transmission Control Protocol
`
`th_sport
`16-bit source port number
`
`th_dport
`16-bit destination port number
`
`15 16
`
`th_seq
`32-bit sequence number
`
`th_of f
`4-bit header
`length
`
`th_x2
`reserved
`(6 bits)
`
`th_ack
`32-bit acknowledgment numbe, r
`UAPRSF
`RCSSYI
`GKHTNN
`
`th_win
`16-bit window size
`
`th_sum
`16-bitTCPchecksum
`
`th_urp
`16-bit urgent offset
`
`options (if any)
`
`data (if any)
`
`Chapter
`
`31
`
`20 bytes
`
`Figure 24.10 TCP header and optional data.
`
`Most RFCs, most books (including Volume 1), and the code we’ll examine call th_urp the
`urgent pointer. A better term is the urgent offset, since this field is a 16-bit unsigned offset that
`must be added to the sequence number field (th_seq) to give the 32-bit sequence number of
`the last byte of urgent data. (There is a continuing debate over whether this sequence number
`points to the last byte of urgent data or to the byte that follows. This is immaterial for the
`present discussion.) We’ll see in Figure 24.13 that TCP correctly calls the 32-bit sequence num-
`
`ber of the last byte of urgent data snd_up the send urgent pointer. But using the term pointer for
`the 16-bit offset in the TCP header is misleading. In Exercise 26.6 we’ll reiterate the distinction
`between the urgent pointer and the urgent offset.
`
`The 4-bit header length, the 6 reserved bits that follow, and the 6 flag bits are
`defined in C as two 4-bit bit-fields, followed by 8 bits of flags. To handle the difference
`in the order of these 4-bit fields within an 8-bit byte, the code contains an # i fde f based
`on the byte order of the system.
`Also notice that we call the 4-bit th_of f the header length, while the C code calls it
`
`the data offset. Both are correct since it is the length of the TCP header,
`options, in 32-bit words, which is the offset of the first byte of data.
`The th_flags member contains 6 flag bits, accessed using the names in Fig-
`ure 24.11.
`In Net/3 the TCP header is normally referenced as an IP header immediately
`lowed by a TCP header. This is how tcp_input processes received IP datagrams and
`how top_output builds outgoing IP datagrams. This combined IP/TCP header is a
`tcpiphdr structure, shown in Figure 24.12.
`
`Ex.1013.828
`
`DELL
`
`
`
`24.5
`
`TCP Control Block 803
`
`th_flags
`TH_ACK
`TH_FIN
`TH_ PUSH
`
`TH_RST
`TH_SYN
`TH_ URG
`
`Description
`the acknowledgment number (th_ack) is valid
`
`the sender is finished sending data
`receiver should pass the data to application without delay
`reset the connection
`
`
`the urgent offset (th_urp) is valid
`
`synchronize sequence numbers (establish connection)
`
`Figure 24.11 th_flags values.
`
`38 struct tcpiphdr {
`39
`struct ipovly ti_i;
`40
`struct tcphdr ti_t;
`41 ];
`
`/* overlaid ip structure */
`/* tcp header */
`
`tcpip.h
`
`42 #define ti_next
`43 #define ti_prev
`44 #define ti_xl
`45 #define ti_pr
`46 #define ti_len
`47 #define ti_src
`48 #define ti_dst
`49 #define ti_sport
`50 #define ti_dport
`51 #define ti_seq
`52 #define ti_ack
`53 #define ti_x2
`54 #define ti_off
`55 #define ti_flags
`56 #define ti_win
`57 #define ti_sum
`58 #define ti_urp
`
`ti_i ih_next
`ti_i ih_prev
`ti_i ih_xl
`ti_i ih__pr
`ti_i ih_len
`ti_i ih_src
`ti_i ih_dst
`ti_t th_sport
`ti_t.th_dport
`ti_t.th_seq
`ti_t.th_ack
`ti_t.th_x2
`ti_t.th_off
`ti_t.th_flags
`ti_t.th_win
`ti_t.th_sum
`ti_t.th_urp
`
`tcpip.h
`
`Figure 24.12 tcpiphdr structure: combined IP/TCP header.
`
`38--58
`
`an ipovly structure, which we showed earlier
`The 20-byte IP header is defined as
`in Figure 23.12. As we discussed with Figure 23.19, this structure is not a real IP header,
`although the lengths are the same (20 bytes).
`
`24.5
`
`TCP Control Block
`
`In Figure 22.1 we showed that TCP maintains its own control block, a tcpcb structure,
`in addition to the standard Internet PCB. In contrast, UDP has everything it needs in
`the Internet PCB--it doesn’t need its own control block.
`The TCP control block is a large structure, occupying 140 bytes. As shown in Fig-
`ure 22.1 there is a one-to-one relationship between the Internet PCB and the TCP control
`block, and each points to the other. Figure 24.13 shows the definition of the TCP control
`block.
`
`Ex.1013.829
`
`DELL
`
`
`
`41
`42
`43
`44
`45
`46
`47
`48
`49
`50
`51
`52
`53
`54 /*
`The following fields are used as in the protocol specification.
`55 *
`See RFC783, Dec. 1981, page 21.
`56 *
`*/
`57
`/* send sequence variables */
`58
`/* send unacknowledged */
`tcp_seq snd_una;
`59
`/* send next */
`tcp_seq snd_nxt;
`6O
`/* send urgent pointer */
`tcp_seq snd_up;
`61
`/* window update seg seq number *!
`tcp_seq snd_wll;
`62
`/* window update seg ack number *!
`tcp_seq snd_wl2;
`63
`/* initial send sequence number */
`tcp_seq iss;
`64
`!* send window */
`u_long snd_wnd;
`65
`receive sequence variables */
`66 /*
`/* receive window */
`u_long rcv_wnd;
`67
`/* receive next */
`tcp_seq rcv_nxt;
`68
`/* receive urgent pointer */
`tcp_seq rcv_up;
`69
`/* initial receive sequence number */
`tcp_seq irs;
`7O
`71
`72
`73
`74
`75
`76 /*
`77
`78
`79 /*
`8O
`81
`82
`83 /*
`84 * transmit timing stuff. See below for scale of srtt and rttvar.
`¯ "Variance" is actually smoothed difference.
`85
`86 */
`87
`88
`89
`90
`91
`92
`93
`
`* Additional variables for this
`*/
`* receive variables */
`tcp_seq rcv_adv;
`retransmit variables */
`tcp_seq snd_max;
`
`implementation.
`
`!* advertised window by other end *!
`
`8O4
`
`TCP: Transmission Control Protocol
`
`Section
`
`tcp_var.h
`
`struct tcpcb {
`/* reassembly queue of received segments */
`struct tcpiphdr *seg_next;
`/* reassembly queue of received segments */
`struct tcpiphdr *seg_prev;
`/* connection state (Figure 24.16) *!
`t_state;
`short
`t_timer[TCPT_NTIMERS]; /* tcp timers (Chapter 25) */
`short
`!* log(2) of rexmt exp. backoff */
`t_rxtshift;
`short
`/* current retransmission timeout (#ticks) */
`t_rxtcur;
`short
`/* #consecutive duplicate ACKs received */
`t_dupacks;
`short
`/* maximum segment size to send */
`u_short t_maxseg;
`/* 1 if forcing out a byte (persist/OOB)
`t_force;
`char
`/* (Figure 24.14) */
`u_short t_flags;
`/* skeletal packet for transmit *!
`struct tcpiphdr *t_template;
`/* back pointer to internet PCB */
`struct inpcb *t_inpcb;
`
`/* highest sequence number sent;
`* used to recognize retransmits */
`*/
`congestion control (slow start, source quench, retransmit after loss)
`/* congestion-controlled window */
`snd_cwnd;
`u_long
`/* snd_cwnd size threshhold for slow start
`snd_ssthresh;
`u_long
`* exponential to linear switch */
`
`short t_idle;
`short t_rtt;
`tcp_seq t_rtseq;
`short t_srtt;
`short t_rttvar;
`u_short t_rttmin;
`u_long max_sndwnd;
`
`/* inactivity time */
`/* round-trip time */
`/* sequence number being timed *!
`/* smoothed round-trip time */
`/* variance in round-trip time */
`/* minimum rtt allowed */
`/* largest window peer has offered */
`
`24.6
`
`Ex.1013.830
`
`DELL
`
`
`
`24.6
`
`TCP State Transition Diagram 805
`
`/* TCPOOB_HAVEDATA, TCPOOB_HADDATA */
`/* input character, if not SO_OOBINLINE */
`/* possible error not yet reported */
`
`/* scaling for send window (0-14) */
`/* scaling for receive window (0-14) */
`/* our pending window scale */
`/* peer’s pending window scale */
`/* timestamp echo data */
`/* when last updated *!
`/* sequence number of last ack field */
`
`94 /* out-of-band data */
`95
`char
`t_oobflags;
`96
`char
`t_iobc;
`97
`short
`t_softerror;
`98 /* RFC 1323 variables */
`99
`u_char snd_scale;
`i00
`u_char
`rcv_scale;
`I01
`u_char
`request_r_scale;
`102
`u_char
`requested_s_scale;
`103
`u_long is_recent;
`104
`u_long ts_recent_age;
`105
`tcp_seq last_ack_sent;
`106 ];
`107 #define intotcpcb(ip)
`108 #define sototcpcb(so)
`
`((struct tcpcb *) (ip)->inp_ppcb)
`(intotcpcb(sotoinpcb(so)))
`
`tcp_var.h
`
`
`
`Figure 24.13 t cpcb structure: TCP control block.
`
`We’ll save the discussion of these variables until we encounter them in the code.
`Figure 24.14 shows the values for the t_flags member.
`
`t_flags
`
`TF_ACKNOW
`TF_DELACK
`TF_NODELAY
`TF_NOOPT
`TF_SENTFIN
`TF_RCVD_SCALE
`TF_RCVD_TSTMP
`TF_REQ_SCALE
`TF_REQ_TSTMP
`
`Description
`send ACK immediately
`send ACK, but try to delay it
`don’t delay packets to coalesce (disable Nagle algorithm)
`don’t use TCP options (never set)
`have sent FIN
`set when other side sends window scale option in SYN
`set when other side sends timestamp option in SYN
`have/will request window scale option in SYN
`have/will request timestamp option in SYN
`
`Figure 24.14 t_flags values.
`
`24.6
`
`TCP State Transition Diagram
`
`Many of TCP’s actions, in response to different types of segments arriving on a connec-
`tion, can be summarized in a state transition diagram, shown in Figure 24.15. We also
`duplicate this diagram on one of the front end papers, for easy reference while reading
`the TCP chapters.
`These state transitions define the TCP finite state machine. Although the transition
`from LISTEN to SYN_SENT is allowed by TCP, there is no way to do this using the ¯
`sockets API (i.e., a connect is not allowed after
`a
`listen).
`The t_state member of the control block holds the current state of a connection,
`with the values shown in Figure 24.16.
`This figure also shows the top_out flags array, which contains the outgoing flags
`for top_output to use when the connection is in that state.
`
`Ex.1013.831
`
`DELL
`
`
`
`8O6
`
`TCP: Transmission Control Protocol
`
`Section
`
`starting point
`
`~ CLOSED
`
`appl: passive open
`send: <nothing>
`
`timeout
`send:
`
`send: SYN, ACK
`simultaneous open
`
`SYN_SENT
`active open
`
`appl: close
`or timeout
`
`Half-C
`
`2417
`
`Ex.1013.832
`
`recv~ ~ _I~.~CLOSE WAIT~
`send: ACK
`
`! !
`
`appl:! close
`send: ! FIN
`
`II
`
`(LAB ACK~-~s ........
`T~
`"x~ recv: ACK
`-.~
`end: <nothing>
`
`t_ J
`passive close
`
`appl:
`send:
`
`close
`FIN
`
`tata transfer state
`
`~ ~- - - - J
`-~ recv: FIN
`.-
`QFIN_WAIT_I~ ~ CLOSIN 3.)
`
`simultaneous close
`
`recv" I ACK
`
`%ONyX’g4. recv: [ACK
`
`
`" "¢y’’~ send:l<nOthing> ’1
`
`send:l<n°thing>" --~ recv: FIN ~’~
`
`"
`
`(FIN WAIT 2) send:ACK
`
`2MSL timeout
`
`active close
`¯~. normal transitions for client
`normal transitions for server
`state transitions taken when application issues operation
`state transitions taken when segment received
`what is sent for this transition
`Figure 24.15 TCP state transition diagram.
`
`----~.
`appl:
`recv:
`send:
`
`DELL
`
`
`
`24.7
`
`TCP Sequence Numbers 807
`
`tcp_out flags [ ]
`TH_RST
`
`/ TH_ACK
`
`[ TH_ACK
`TH_ACK
`TH_ACK
`
`/ TH_ACK
`/ TH_ACK
`/ TH_ACK
`TH_ACK
`TH_ACK
`
`0T
`
`H_SYN
`TH_SYN
`
`TH_FIN
`TH_FIN
`TH_FIN
`
`t_state
`TCPS_CLOSED
`TOPS_LISTEN
`TOPS SYN SENT
`TCPS SYN RECEIVED
`TCPS_ESTABLISHED
`TOPS_CLOSE_WAIT
`TOPS FIN WAIT_I
`TCPS_CLOSING
`TCPS_LAST_ACK
`TOPS FIN WAIT_2
`TCPS_TIME_WAIT
`
`value
`
`Description
`
`0
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`
`closed
`listening for connection (passive open)
`have sent SYN (active open)
`have sent and received SYN; awaiting ACK
`established (data transfer)
`received FIN, waiting for application close
`have closed, sent FIN; awaiting ACK and FIN
`simultaneous close; awaiting ACK
`received FIN have closed; awaiting ACK
`have closed; awaiting FIN
`2MSL wait state after active close
`
`Figure 24.16 t_state values.
`
`Figure 24.16 also shows the numerical values of these constants since the code uses
`their numerical relationships. For example, the following two macros are defined:
`
`#define
`#define
`
`TCPS_HAVERCVDSYN(s) ((s) >: TCPS_SYN_RECEIVED)
`TCPS_HAVERCVDFIN(s) ((s) >: TCPS_TIME_WAIT)
`
`Similarly, we’ll see that tcp_not i fy handles ICMP errors differently when the connec-
`tion is not yet established, that is, when t_s tare is less than TCPS_ESTABLISHED.
`
`The name TCPS_HAVERCVDSYN is correct, but the name TCPS_HAVERCVDFIN is misleading.
`A FIN has also been received in the CLOSE_WAIT, CLOSING, and LAST_ACK states. We
`encounter this macro in Chapter 29.
`
`Half-Close
`
`When a process calls shutdown with a second argument of 1, it is called a half-close.
`TCP sends a FIN but allows the process to continue receiving on the socket. (Sec-
`tion 18.5 of Volume 1 contains examples of TCP’s half-close.)
`For example, even though we label the ESTABLISHED state "data transfer," if the
`process does a half-close, moving the connection to the FIN_WAIT_I and then the
`FIN_WAIT_2 states, data can continue to be received by the process in these two states.
`
`24.7 TCP Sequence Numbers
`
`Every byte of data exchanged across a TCP connection, along with the SYN and FIN
`flags, is assigned a 32-bit sequence number. The sequence number field in the TCP
`header (Figure 24.10) contains the sequence number of the first byte of data in the seg-
`ment. The acknowledgment number field in the TCP header contains the next sequence
`number that the sender of the ACK expects to receive, which acknowledges all data
`bytes through the acknowledgment number minus 1. In other words, the acknowledg-
`ment number is the next sequence number expected by the sender of the ACK. The
`acknowledgment number is valid only if the ACK flag is set in the header. We’ll see
`
`Ex.1013.833
`
`DELL
`
`
`
`808
`
`TCP: Transmission Control Protocol
`
`Chapter
`
`Secti.
`
`that TCP always sets the ACK flag except for the first SYN sent by an active open (the
`SYN_SENT state; see tcp_ou¢ f lags I2 ] in Figure 24.16) and in some RST segments.
`Since a TCP connection is
`each end must maintain a set of sequence
`full-duplex,
`numbers for both directions of data flow. In the TCP control block (Figure 24.13) there
`
`are 13 sequence numbers: eight for the send direction (the send sequence space)
`and five
`for the receive direction (the receive sequence space).
`Figure 24.17 shows the relationship of four of the variables in the send sequence
`space: snd_wnd, snd_una, snd_nxt, and snd_max. In this example we number the
`bytes I through 11.
`
`snd wnd : 6: offered window
`(advertised by receiver)
`
`usable window
`
`7
`
`8
`
`9
`
`10
`
`11
`
`.-.
`
`sent and
`acknowledged
`
`sent, not ACKed ~
`
`can send ASAP
`
`can’t send until
`window moves
`
`snd_una = 4
`oldest
`unacknowledged
`sequence number
`
`snd_nxt = 7
`next send
`sequence number
`
`snd_max : 7
`max.imum send
`sequence number
`
`Figure 24.17 Example of send sequence space.
`
`An acceptable ACK is one for which the following inequality holds:
`
`snd_una < acknowledgment field <: snd_max
`
`In Figure 24.17 an acceptable ACK has an acknowledgment field of 5, 6, or 7. An
`acknowledgment field less than or equal to snd_una is a duplicate ACK--it acknowl-
`edges data that has already been ACKed, or else snd_una would not have incremented
`past those bytes.
`We encounter the following test a few times in tcp_output, which is true if a seg-
`ment is being retransmitted:
`snd_nxt < snd_max
`
`Figure 24.18 shows the other end of the connection in Figure 24.17: the receive
`sequence space, assuming the segment containing sequence numbers 4, 5, and 6 has not
`been received yet. We show the three variables rcv_nxt, rcv_wnd, and rcv_adv.
`
`Ex.1013.834
`
`DELL
`
`
`
`Ex.1013.835
`
`24.7
`
`TCP Sequence Numbers
`
`809
`
`= 6: receive window
`rcv_wnd
`(advertised to sender)
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10
`
`11
`
`¯ ¯ ¯
`
`old sequence numbers
`that TCP has ackn°wledged ~ ~
`
`rcv_nxt = 4
`next receive
`sequence number
`
`future sequence numbers
`not yet allowed
`
`~
`rcv_adv = 10
`highest advertised
`sequence number
`plus 1
`
`Figure 24.18 Example of receive sequence space.
`
`The receiver considers a received segment valid if it contains data within the win-
`dow, that is, if either of the following two inequalities is true:
`rcv_nxt <: beginning sequence number of segment < rcv_nxt + rcv_wnd
`
`rcv_nxt <= ending sequencenumberofsegment < rcv_nxt + rcv_wnd
`
`The beginning sequence number of a segment is just the sequence number field in the
`TCP header, ti_seq. The ending sequence number is the ~equence number field plus
`the number of bytes of TCP data, minus 1.
`For example, Figure 24.19 could represent the TCP segment containing the 3 bytes
`with sequence numbers 4, 5, and 6 in Figure 24.17.
`
`[--
`
`63-byte IP datagram
`
`IP header IP TCP header
`options
`8
`20 bytes
`20
`Figure 24.19 TCP segment transmitted as an IP datagram.
`
`TCP
`options
`12
`
`~
`
`1 1 1
`
`We assume that there are 8 bytes of IP options and 12 bytes of TCP options. Fig-
`ure 24.20 shows the values of the relevant variables.
`
`Description
`length of IP header + options in 32-bit words (= 28 bytes)
`length of IP datagram in bytes (20 + 8 + 20 + 12 + 3)
`length of TCP header + options in 32-bit words (= 32 bytes)
`sequence number of first byte of data
`#bytes of TCP data: ip_len - (ip_hl x 4) - (t i_o f £ x 4)
`sequence number of last byte of data: t i_s eq + t i_l en - 1
`
`Variable Value
`ip_hl
`7
`ip_len
`63
`8
`ti_off
`ti_seq
`4
`ti_len
`
`36
`
`Figure 24.20 Values of variables corresponding to Figure 24.19.
`
`DELL
`
`
`
`TCP: Transmission Control Protocol Chapter 24
`
`810
`
`Sect
`
`ti_len is not a field that is transmitted in the TCP header. Instead, it is computed a -~
`shown in Figure 24.20 and stored in the overlaid IP structure (Figure 24.12) once the
`received header fields have been checksummed and verified. The last value in this fig-
`ure is not stored in the header, but is computed from the other values when needed.
`
`Modular Arithmetic with Sequence Numbers
`
`A problem that TCP must deal with is that the sequence numbers are from a finite 32-bit
`number space: 0 through 4,294,967,295. If more than
`bytes of data are exchanged
`232
`across a TCP connection, the sequence numbers will be reused. Sequence numbers
`wrap around from 4,294,967,295 to 0.
`Even if less than
`232 bytes of data are exchanged, wrap around is still a problem
`because the sequence numbers for a connection don’t necessarily start at 0. The initial
`sequence number for each direction of data flow across a connection can start anywhere
`between 0 and 4,294,967,295. This complicates the comparison of sequence numbers.
`" reater than" 4,294,967,295, as we discuss below.
`For example, sequence number I is g
`TCP sequence numbers are defined
`as unsigned longs in tcp. h:
`typedef u_long tcp_seq;
`
`The four macros shown in Figure 24.21 compare sequence numbers.
`
`40 #define SEQ LT(a,b)
`41 #define SEQ_LEQ(a,b)
`42 #define SEQ_GT(a,b)
`43 #define SEQ_GEQ(a,b)
`
`((int) ((a)-(b)) < 0)
`((int) ((a)- (b)) <: 0)
`((int) ((a)-(b)) > 0) .....
`((int) ((a)-(b)) >: 0)
`
`Figure 24.21
`
`Macros for TCP sequence number comparison.
`
`tcp_seq.h
`
`tcp_seq.h
`
`Example--Sequence Number Comparisons
`
`Let’s look at an example to see how TCP’s sequence numbers operate. Assume 3-bit
`sequence numbers, 0 through 7. Figure 24.22 shows these eight sequence numbers,
`their 3-bit binary representation, and their two’s complement representation. (To
`the two’s complement take the binary num