Transmission Control Protocol β reliable, ordered, byte-stream transport over IP. Connection-oriented, congestion-aware. Compare with UDP (connectionless, no ordering, no retransmission).
Header (20 bytes minimum)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source Port | Destination Port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgment Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Data | |U|A|P|R|S|F| |
| Offset| Reserved |R|C|S|S|Y|I| Window |
| | |G|K|H|T|N|N| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Checksum | Urgent Pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Options (variable) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Flags
| Flag | Name | Meaning |
|---|---|---|
| SYN | Synchronize | Initiate connection (use seq numbers) |
| ACK | Acknowledgment | ACK field valid |
| FIN | Finish | I have no more data; half-close my side |
| RST | Reset | Abort connection (error / unexpected) |
| PSH | Push | Deliver to app immediately, donβt buffer |
| URG | Urgent | Urgent pointer valid (rarely used in practice) |
Three-way handshake
client server
β β
β ββββ SYN, seq=x ββββββββββββββββββββββββββββββββββΊ β LISTEN β SYN_RECV
β β
β βββββ SYN+ACK, seq=y, ack=x+1 ββββββββββββββββββββ β
β β
β ββββ ACK, seq=x+1, ack=y+1 βββββββββββββββββββββββΊ β SYN_RECV β ESTABLISHED
ESTABLISHED ESTABLISHED
Four-way close (graceful)
A β B: FIN A: FIN_WAIT_1
A β B: ACK A: FIN_WAIT_2 B: CLOSE_WAIT
A β B: FIN B: LAST_ACK
A β B: ACK A: TIME_WAIT (2ΓMSL) β CLOSED
TIME_WAIT (default ~60s on Linux) lets late duplicate segments be discarded. High TIME_WAIT counts on busy servers are normal, not a leak.
State machine (highlights)
CLOSED β LISTEN (server bind + listen)
CLOSED β SYN_SENT (client connect)
SYN_SENT β ESTABLISHED
LISTEN β SYN_RECV β ESTABLISHED
ESTABLISHED β FIN_WAIT_1 / CLOSE_WAIT (depending on who closes first)
β¦ β TIME_WAIT β CLOSED
Reliability mechanisms
- Sequence numbers byte-count every payload byte β ordering + duplicate detection.
- Cumulative ACKs: ACK n means βI have everything up to byte n-1β.
- Selective ACK (SACK) option lets the receiver report non-contiguous received ranges.
- Retransmission: RTO (timer, ~2ΓRTT) or fast retransmit on 3 duplicate ACKs.
- Checksum covers header + data + pseudo-header (peer addresses).
Flow control
The receiver advertises a window (bytes the sender may have in flight without ACK).
Window = 0β receiver is full; sender pauses, probes with zero-window probes.- Window scaling option (RFC 1323) shifts the 16-bit field up to 30 bits for fat pipes.
Congestion control
Distinct from flow control β protects the network, not the receiver.
cwnd : congestion window (bytes sender can have in flight)
ssthresh : slow-start threshold
slow start : cwnd doubles each RTT until ssthresh or loss
congestion avoidance: cwnd += MSS per RTT
fast recovery : on 3 dup-ACKs, halve cwnd, retransmit, continue
RTO timeout : cwnd β 1 MSS, ssthresh halved
Algorithms: Reno, CUBIC (Linux default), BBR (Google; throughput/RTT-based, ignores loss as primary signal).
sysctl net.ipv4.tcp_congestion_control # current default
sysctl -w net.ipv4.tcp_congestion_control=bbr
ss -ti # per-socket: cwnd, rtt, retransMSS, MTU, PMTU
- MTU β max IP packet size on a link (Ethernet: 1500).
- MSS β max TCP payload = MTU β IP header β TCP header. Negotiated in SYN options.
- Path MTU Discovery finds smallest MTU on the path via ICMP βfragmentation neededβ. Black holes happen when ICMP is filtered β connections hang on large transfers.
Options (common)
| Option | Purpose |
|---|---|
| MSS | Announce max segment size |
| Window Scale | Shift the 16-bit Window field |
| SACK Permitted | Both sides will use selective ACK |
| Timestamps | Better RTT measurement, PAWS (wrap protection) |
| TCP Fast Open | Carry data in SYN on repeat connections |
Nagle, delayed ACK, cork
- Nagleβs algorithm coalesces small writes (βsend only if no outstanding small unACKed dataβ). Reduces packets, adds latency. Disable on interactive protocols:
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)); - Delayed ACK holds the ACK ~40ms hoping to piggyback on a reply.
- TCP_CORK (Linux): hold sends until cork released β efficient for headers+body.
TCP_NOPUSHon BSD.
Nagle + delayed-ACK interaction is a classic source of mysterious 200ms latency.
Ports
| Range | Class |
|---|---|
| 0β1023 | Well-known (HTTP 80, HTTPS 443, SSH 22, SMTP 25) |
| 1024β49151 | Registered |
| 49152β65535 | Ephemeral (client source ports) |
Linux ephemeral range: cat /proc/sys/net/ipv4/ip_local_port_range.
Diagnostic toolbox
ss -tunlp # listening sockets (TCP/UDP, numeric, processes)
ss -tan state established # all established TCP sockets
ss -ti # per-socket TCP info (cwnd, rtt, retrans)
ss -s # summary by state
netstat -anp # older equivalent of ss
# Reachability
nc -vz host 443 # is port open?
nc -l 9000 # listen on 9000 (server)
nc host 9000 # connect (client)
# Path / latency
ping host
traceroute host / tracepath host
mtr host # continuous traceroute + loss
# Packet capture
sudo tcpdump -ni eth0 'tcp port 443 and host 1.2.3.4'
sudo tcpdump -w cap.pcap β¦ # write for Wireshark
tshark -r cap.pcap -Y 'tcp.flags.syn==1 and tcp.flags.ack==0'
# Throughput
iperf3 -s # server
iperf3 -c server -P 4 -t 30 # 4 parallel streams, 30s
# Kernel knobs (Linux)
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem # buffer sizes
sysctl net.ipv4.tcp_window_scaling
sysctl net.core.somaxconn # listen() backlog capReading ss -ti
ESTAB 0 0 10.0.0.1:443 10.0.0.2:51234
cubic wscale:7,7 rto:204 rtt:3.2/1.5 mss:1448 cwnd:10 ssthresh:7
bytes_acked:12345 segs_out:12 segs_in:10
rtt: smoothed RTT / mean deviation (ms)cwnd: congestion window in MSSssthresh: slow-start thresholdretrans: cumulative / current outstanding retransmits
TCP vs UDP
| Aspect | TCP | UDP |
|---|---|---|
| Connection | Yes (handshake) | None |
| Delivery | Reliable, ordered | Best-effort, unordered |
| Congestion ctrl | Built-in | App-level (or none) |
| Header | 20+ bytes | 8 bytes |
| Use cases | HTTP/1.1, HTTP/2, SSH, SMTP, DBs | DNS, NTP, VoIP, gaming, QUIC base |
QUIC (the basis of HTTP/3) is over UDP but implements TCP-like reliability + congestion control in userspace.
Gotchas
- A
RSTis not graceful β peerβs pending reads/writes error out. UseFINfor clean closes. EADDRINUSEafter restart: thereβs aTIME_WAITsocket on that port. Solutions:SO_REUSEADDR, change port, wait it out.- Bare-socket apps must call
setsockopt(TCP_NODELAY)for low-latency RPC; Nagle is on by default. - A βstuckβ connection is often dropped NAT mapping: idle longer than the middlebox timer. Mitigate with TCP keepalive (
SO_KEEPALIVE+ tuned intervals) or app-level pings. - High
Recv-Qinssmeans the app isnβt reading fast enough β not a network problem.