Skip to content

Commit

Permalink
Merge branch 'tcp-rack'
Browse files Browse the repository at this point in the history
Yuchung Cheng says:

====================
RACK loss detection

RACK (Recent ACK) loss recovery uses the notion of time instead of
packet sequence (FACK) or counts (dupthresh).

It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
limited transmit (new data packet) is sacked in recovery, then any
retransmission sent before that newly sacked packet was sent must have
been lost, since at least one round trip time has elapsed.

But that existing heuristic from tcp_mark_lost_retrans()
has several limitations:
  1) it can't detect tail drops since it depends on limited transmit
  2) it's disabled upon reordering (assumes no reordering)
  3) it's only enabled in fast recovery but not timeout recovery

RACK addresses these limitations with a core idea: an unacknowledged
packet P1 is deemed lost if a packet P2 that was sent later is is
s/acked, since at least one round trip has passed.

Since RACK cares about the time sequence instead of the data sequence
of packets, it can detect tail drops when a later retransmission is
s/acked, while FACK or dupthresh can't. For reordering RACK uses a
dynamically adjusted reordering window ("reo_wnd") to reduce false
positives on ever (small) degree of reordering, similar to the delayed
Early Retransmit.

In the current patch set RACK is only a supplemental loss detection
and does not trigger fast recovery. However we are developing RACK
to replace or consolidate FACK/dupthresh, early retransmit, and
thin-dupack. These heuristics all implicitly bear the time notion.
For example, the delayed Early Retransmit is simply applying RACK
to trigger the fast recovery with small inflight.

RACK requires measuring the minimum RTT. Tracking a global min is less
robust due to traffic engineering pathing changes. Therefore it uses a
windowed filter by Kathleen Nichols. The min RTT can also be useful
for various other purposes like congestion control or stat monitoring.

This patch has been used on Google servers for well over 1 year. RACK
has also been implemented in the QUIC protocol. We are submitting an
IETF draft as well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
David S. Miller committed Oct 21, 2015
2 parents c8fdc32 + 4f41b1c commit eb9fae3
Show file tree
Hide file tree
Showing 11 changed files with 286 additions and 86 deletions.
17 changes: 17 additions & 0 deletions Documentation/networking/ip-sysctl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,14 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
Defaults are calculated at boot time from amount of available
memory.

tcp_min_rtt_wlen - INTEGER
The window length of the windowed min filter to track the minimum RTT.
A shorter window lets a flow more quickly pick up new (higher)
minimum RTT when it is moved to a longer path (e.g., due to traffic
engineering). A longer window makes the filter more resistant to RTT
inflations such as transient congestion. The unit is seconds.
Default: 300

tcp_moderate_rcvbuf - BOOLEAN
If set, TCP performs receive buffer auto-tuning, attempting to
automatically size the buffer (no greater than tcp_rmem[2]) to
Expand Down Expand Up @@ -425,6 +433,15 @@ tcp_orphan_retries - INTEGER
you should think about lowering this value, such sockets
may consume significant resources. Cf. tcp_max_orphans.

tcp_recovery - INTEGER
This value is a bitmap to enable various experimental loss recovery
features.

RACK: 0x1 enables the RACK loss detection for fast detection of lost
retransmissions and tail drops.

Default: 0x1

tcp_reordering - INTEGER
Initial reordering level of packets in a TCP stream.
TCP stack can then dynamically adjust flow reordering level
Expand Down
9 changes: 9 additions & 0 deletions include/linux/skbuff.h
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,15 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
return delta_us;
}

static inline bool skb_mstamp_after(const struct skb_mstamp *t1,
const struct skb_mstamp *t0)
{
s32 diff = t1->stamp_jiffies - t0->stamp_jiffies;

if (!diff)
diff = t1->stamp_us - t0->stamp_us;
return diff > 0;
}

/**
* struct sk_buff - socket buffer
Expand Down
11 changes: 9 additions & 2 deletions include/linux/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,12 @@ struct tcp_sock {
u32 window_clamp; /* Maximal window to advertise */
u32 rcv_ssthresh; /* Current window clamp */

/* Information of the most recently (s)acked skb */
struct tcp_rack {
struct skb_mstamp mstamp; /* (Re)sent time of the skb */
u8 advanced; /* mstamp advanced since last lost marking */
u8 reord; /* reordering detected */
} rack;
u16 advmss; /* Advertised MSS */
u8 unused;
u8 nonagle : 4,/* Disable Nagle algorithm? */
Expand All @@ -217,6 +223,9 @@ struct tcp_sock {
u32 mdev_max_us; /* maximal mdev for the last rtt period */
u32 rttvar_us; /* smoothed mdev_max */
u32 rtt_seq; /* sequence number to update rttvar */
struct rtt_meas {
u32 rtt, ts; /* RTT in usec and sampling time in jiffies. */
} rtt_min[3];

u32 packets_out; /* Packets which are "in flight" */
u32 retrans_out; /* Retransmitted packets out */
Expand Down Expand Up @@ -280,8 +289,6 @@ struct tcp_sock {
int lost_cnt_hint;
u32 retransmit_high; /* L-bits may be on up to this seqno */

u32 lost_retrans_low; /* Sent seq after any rxmit (lowest) */

u32 prior_ssthresh; /* ssthresh saved at recovery start */
u32 high_seq; /* snd_nxt at onset of congestion */

Expand Down
21 changes: 21 additions & 0 deletions include/net/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,7 @@ extern int sysctl_tcp_limit_output_bytes;
extern int sysctl_tcp_challenge_ack_limit;
extern unsigned int sysctl_tcp_notsent_lowat;
extern int sysctl_tcp_min_tso_segs;
extern int sysctl_tcp_min_rtt_wlen;
extern int sysctl_tcp_autocorking;
extern int sysctl_tcp_invalid_ratelimit;
extern int sysctl_tcp_pacing_ss_ratio;
Expand Down Expand Up @@ -566,6 +567,7 @@ void tcp_resume_early_retransmit(struct sock *sk);
void tcp_rearm_rto(struct sock *sk);
void tcp_synack_rtt_meas(struct sock *sk, struct request_sock *req);
void tcp_reset(struct sock *sk);
void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);

/* tcp_timer.c */
void tcp_init_xmit_timers(struct sock *);
Expand Down Expand Up @@ -671,6 +673,12 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
return dst_metric_locked(dst, RTAX_CC_ALGO);
}

/* Minimum RTT in usec. ~0 means not available. */
static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
{
return tp->rtt_min[0].rtt;
}

/* Compute the actual receive window we are currently advertising.
* Rcv_nxt can be after the window if our peer push more data
* than the offered window.
Expand Down Expand Up @@ -1743,6 +1751,19 @@ int tcpv4_offload_init(void);
void tcp_v4_init(void);
void tcp_init(void);

/* tcp_recovery.c */

/* Flags to enable various loss recovery features. See below */
extern int sysctl_tcp_recovery;

/* Use TCP RACK to detect (some) tail and retransmit losses */
#define TCP_RACK_LOST_RETRANS 0x1

extern int tcp_rack_mark_lost(struct sock *sk);

extern void tcp_rack_advance(struct tcp_sock *tp,
const struct skb_mstamp *xmit_time, u8 sacked);

/*
* Save and compile IPv4 options, return a pointer to it
*/
Expand Down
1 change: 1 addition & 0 deletions net/ipv4/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ obj-y := route.o inetpeer.o protocol.o \
inet_timewait_sock.o inet_connection_sock.o \
tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \
tcp_minisocks.o tcp_cong.o tcp_metrics.o tcp_fastopen.o \
tcp_recovery.o \
tcp_offload.o datagram.o raw.o udp.o udplite.o \
udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
fib_frontend.o fib_semantics.o fib_trie.o \
Expand Down
14 changes: 14 additions & 0 deletions net/ipv4/sysctl_net_ipv4.c
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,13 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
{
.procname = "tcp_recovery",
.data = &sysctl_tcp_recovery,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "tcp_reordering",
.data = &sysctl_tcp_reordering,
Expand Down Expand Up @@ -576,6 +583,13 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
{
.procname = "tcp_min_rtt_wlen",
.data = &sysctl_tcp_min_rtt_wlen,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec
},
{
.procname = "tcp_low_latency",
.data = &sysctl_tcp_low_latency,
Expand Down
1 change: 1 addition & 0 deletions net/ipv4/tcp.c
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,7 @@ void tcp_init_sock(struct sock *sk)

icsk->icsk_rto = TCP_TIMEOUT_INIT;
tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
tp->rtt_min[0].rtt = ~0U;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
Expand Down
Loading

0 comments on commit eb9fae3

Please sign in to comment.