Skip to content

Commit

Permalink
Merge branch 'tcp_ack_loops'
Browse files Browse the repository at this point in the history
Neal Cardwellsays:

====================
tcp: mitigate TCP ACK loops due to out-of-window validation dupacks

This patch series mitigates "ack loop" DoS scenarios by rate-limiting
outgoing duplicate ACKs sent in response to incoming "out of window"
segments.

Background
-----------

There are several cases in which the TCP RFCs specify that a TCP
endpoint should send a pure duplicate ACK in response to a pure
duplicate ACK that appears to be invalid due to being "out of window":

(1) RFC 793 (section 3.9, page 69) specifies that endpoints should
    send a duplicate ACK in response to an ACK when the incoming
    sequence number is invalid due to being outside the receive
    window: "If an incoming segment is not acceptable, an
    acknowledgment should be sent in reply".

(2) RFC 793 (section 3.9, page 72) says: "If the ACK acknowledges
    something not yet sent (SEG.ACK > SND.NXT) then send an ACK".

(3) RFC 1323 (section 4.2.1, page 18) specifies that endpoints should
    send a duplicate ACK in response to an ACK when the PAWS check for
    the incoming timestamp value fails: "If .... SEG.TSval < TS.Recent
    and if TS.Recent is valid ... Send an acknowledgement in reply"

The problem
------------

Normally, this is not a problem. However, a buggy middlebox or
malicious man-in-the-middle can inject a few packets into the
conversation that advance each endpoint's notion of the current window
(sequence, ACK, or timestamp), without either side noticing. In this
case, from then on each side can think the other is sending invalid
segments. Thus an infinite feedback loop of duplicate ACKs can ensue,
as each endpoint receives a duplicate ACK, decides that it is invalid
(due to sequence number, ACK number, or timestamp), and then sends a
dupack in reply, which the other side decides is invalid, responding
with a dupack... ad infinitum. This ping-pong feedback loop can happen
at a very high rate.

This phenomenon can and does happen in practice. It has been seen in
datacenter and Internet contexts at Google, and has been documented by
Anil Agarwal in the Nov 2013 tcpm thread "TCP mismatched sequence
numbers issue", and Avery Fay in the Feb 2015 Linux netdev thread
"Invalid timestamp? causing tight ack loop (hundreds of thousands of
packets / sec)".

This patch series
------------------

This patch series mitigates such ack loops by rate-limiting outgoing
duplicate ACKs sent in response to incoming TCP packets that are for
an existing connection but that are invalid due to any of the reasons
mentioned above: sequence number (1), ACK field (2), or timestamp
value (3). The rate limit for such duplicate ACKs is specified by a
new sysctl, tcp_invalid_ratelimit, which specifies the minimal space
between such outbound duplicate ACKs, in milliseconds. The default is
500 (500ms), and 0 disables the mechanism.

We rate-limit these duplicate ACK responses rather than blocking them
entirely or resetting the connection, because legitimate connections
can rely on dupacks in response to some out-of-window segments. For
example, zero window probes are typically sent with a sequence number
that is below the current window, and ZWPs thus expect to thus elicit
a dupack in response.

Testing: this approach has been in use at Google for a while.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
David S. Miller committed Feb 8, 2015
2 parents ca53934 + 4fb17a6 commit f06535c
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 13 deletions.
22 changes: 22 additions & 0 deletions Documentation/networking/ip-sysctl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,28 @@ tcp_frto - INTEGER

By default it's enabled with a non-zero value. 0 disables F-RTO.

tcp_invalid_ratelimit - INTEGER
Limit the maximal rate for sending duplicate acknowledgments
in response to incoming TCP packets that are for an existing
connection but that are invalid due to any of these reasons:

(a) out-of-window sequence number,
(b) out-of-window acknowledgment number, or
(c) PAWS (Protection Against Wrapped Sequence numbers) check failure

This can help mitigate simple "ack loop" DoS attacks, wherein
a buggy or malicious middlebox or man-in-the-middle can
rewrite TCP header fields in manner that causes each endpoint
to think that the other is sending invalid TCP segments, thus
causing each side to send an unterminating stream of duplicate
acknowledgments for invalid segments.

Using 0 disables rate-limiting of dupacks in response to
invalid segments; otherwise this value specifies the minimal
space between sending such dupacks, in milliseconds.

Default: 500 (milliseconds).

tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
Expand Down
6 changes: 6 additions & 0 deletions include/linux/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ struct tcp_request_sock {
u32 rcv_isn;
u32 snt_isn;
u32 snt_synack; /* synack sent time */
u32 last_oow_ack_time; /* last SYNACK */
u32 rcv_nxt; /* the ack # by SYNACK. For
* FastOpen it's the seq#
* after data-in-SYN.
Expand Down Expand Up @@ -152,6 +153,7 @@ struct tcp_sock {
u32 snd_sml; /* Last byte of the most recently transmitted small packet */
u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */
u32 lsndtime; /* timestamp of last sent data packet (for restart window) */
u32 last_oow_ack_time; /* timestamp of last out-of-window ACK */

u32 tsoffset; /* timestamp offset */

Expand Down Expand Up @@ -340,6 +342,10 @@ struct tcp_timewait_sock {
u32 tw_rcv_wnd;
u32 tw_ts_offset;
u32 tw_ts_recent;

/* The time we sent the last out-of-window ACK: */
u32 tw_last_oow_ack_time;

long tw_ts_recent_stamp;
#ifdef CONFIG_TCP_MD5SIG
struct tcp_md5sig_key *tw_md5_key;
Expand Down
33 changes: 33 additions & 0 deletions include/net/tcp.h
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,7 @@ extern int sysctl_tcp_challenge_ack_limit;
extern unsigned int sysctl_tcp_notsent_lowat;
extern int sysctl_tcp_min_tso_segs;
extern int sysctl_tcp_autocorking;
extern int sysctl_tcp_invalid_ratelimit;

extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
Expand Down Expand Up @@ -1144,6 +1145,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq;
tcp_rsk(req)->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
tcp_rsk(req)->snt_synack = tcp_time_stamp;
tcp_rsk(req)->last_oow_ack_time = 0;
req->mss = rx_opt->mss_clamp;
req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
ireq->tstamp_ok = rx_opt->tstamp_ok;
Expand Down Expand Up @@ -1236,6 +1238,37 @@ static inline bool tcp_paws_reject(const struct tcp_options_received *rx_opt,
return true;
}

/* Return true if we're currently rate-limiting out-of-window ACKs and
* thus shouldn't send a dupack right now. We rate-limit dupacks in
* response to out-of-window SYNs or ACKs to mitigate ACK loops or DoS
* attacks that send repeated SYNs or ACKs for the same connection. To
* do this, we do not send a duplicate SYNACK or ACK if the remote
* endpoint is sending out-of-window SYNs or pure ACKs at a high rate.
*/
static inline bool tcp_oow_rate_limited(struct net *net,
const struct sk_buff *skb,
int mib_idx, u32 *last_oow_ack_time)
{
/* Data packets without SYNs are not likely part of an ACK loop. */
if ((TCP_SKB_CB(skb)->seq != TCP_SKB_CB(skb)->end_seq) &&
!tcp_hdr(skb)->syn)
goto not_rate_limited;

if (*last_oow_ack_time) {
s32 elapsed = (s32)(tcp_time_stamp - *last_oow_ack_time);

if (0 <= elapsed && elapsed < sysctl_tcp_invalid_ratelimit) {
NET_INC_STATS_BH(net, mib_idx);
return true; /* rate-limited: don't send yet! */
}
}

*last_oow_ack_time = tcp_time_stamp;

not_rate_limited:
return false; /* not rate-limited: go ahead, send dupack now! */
}

static inline void tcp_mib_init(struct net *net)
{
/* See RFC 2012 */
Expand Down
6 changes: 6 additions & 0 deletions include/uapi/linux/snmp.h
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,12 @@ enum
LINUX_MIB_TCPHYSTARTTRAINCWND, /* TCPHystartTrainCwnd */
LINUX_MIB_TCPHYSTARTDELAYDETECT, /* TCPHystartDelayDetect */
LINUX_MIB_TCPHYSTARTDELAYCWND, /* TCPHystartDelayCwnd */
LINUX_MIB_TCPACKSKIPPEDSYNRECV, /* TCPACKSkippedSynRecv */
LINUX_MIB_TCPACKSKIPPEDPAWS, /* TCPACKSkippedPAWS */
LINUX_MIB_TCPACKSKIPPEDSEQ, /* TCPACKSkippedSeq */
LINUX_MIB_TCPACKSKIPPEDFINWAIT2, /* TCPACKSkippedFinWait2 */
LINUX_MIB_TCPACKSKIPPEDTIMEWAIT, /* TCPACKSkippedTimeWait */
LINUX_MIB_TCPACKSKIPPEDCHALLENGE, /* TCPACKSkippedChallenge */
__LINUX_MIB_MAX
};

Expand Down
6 changes: 6 additions & 0 deletions net/ipv4/proc.c
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,12 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPHystartTrainCwnd", LINUX_MIB_TCPHYSTARTTRAINCWND),
SNMP_MIB_ITEM("TCPHystartDelayDetect", LINUX_MIB_TCPHYSTARTDELAYDETECT),
SNMP_MIB_ITEM("TCPHystartDelayCwnd", LINUX_MIB_TCPHYSTARTDELAYCWND),
SNMP_MIB_ITEM("TCPACKSkippedSynRecv", LINUX_MIB_TCPACKSKIPPEDSYNRECV),
SNMP_MIB_ITEM("TCPACKSkippedPAWS", LINUX_MIB_TCPACKSKIPPEDPAWS),
SNMP_MIB_ITEM("TCPACKSkippedSeq", LINUX_MIB_TCPACKSKIPPEDSEQ),
SNMP_MIB_ITEM("TCPACKSkippedFinWait2", LINUX_MIB_TCPACKSKIPPEDFINWAIT2),
SNMP_MIB_ITEM("TCPACKSkippedTimeWait", LINUX_MIB_TCPACKSKIPPEDTIMEWAIT),
SNMP_MIB_ITEM("TCPACKSkippedChallenge", LINUX_MIB_TCPACKSKIPPEDCHALLENGE),
SNMP_MIB_SENTINEL
};

Expand Down
7 changes: 7 additions & 0 deletions net/ipv4/sysctl_net_ipv4.c
Original file line number Diff line number Diff line change
Expand Up @@ -728,6 +728,13 @@ static struct ctl_table ipv4_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
{
.procname = "tcp_invalid_ratelimit",
.data = &sysctl_tcp_invalid_ratelimit,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_ms_jiffies,
},
{
.procname = "icmp_msgs_per_sec",
.data = &sysctl_icmp_msgs_per_sec,
Expand Down
30 changes: 23 additions & 7 deletions net/ipv4/tcp_input.c
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ int sysctl_tcp_thin_dupack __read_mostly;

int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
int sysctl_tcp_early_retrans __read_mostly = 3;
int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;

#define FLAG_DATA 0x01 /* Incoming frame contained data. */
#define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */
Expand Down Expand Up @@ -3321,13 +3322,22 @@ static int tcp_ack_update_window(struct sock *sk, const struct sk_buff *skb, u32
}

/* RFC 5961 7 [ACK Throttling] */
static void tcp_send_challenge_ack(struct sock *sk)
static void tcp_send_challenge_ack(struct sock *sk, const struct sk_buff *skb)
{
/* unprotected vars, we dont care of overwrites */
static u32 challenge_timestamp;
static unsigned int challenge_count;
u32 now = jiffies / HZ;
struct tcp_sock *tp = tcp_sk(sk);
u32 now;

/* First check our per-socket dupack rate limit. */
if (tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDCHALLENGE,
&tp->last_oow_ack_time))
return;

/* Then check the check host-wide RFC 5961 rate limit. */
now = jiffies / HZ;
if (now != challenge_timestamp) {
challenge_timestamp = now;
challenge_count = 0;
Expand Down Expand Up @@ -3423,7 +3433,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
if (before(ack, prior_snd_una)) {
/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
if (before(ack, prior_snd_una - tp->max_window)) {
tcp_send_challenge_ack(sk);
tcp_send_challenge_ack(sk, skb);
return -1;
}
goto old_ack;
Expand Down Expand Up @@ -4992,7 +5002,10 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
tcp_paws_discard(sk, skb)) {
if (!th->rst) {
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
tcp_send_dupack(sk, skb);
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDPAWS,
&tp->last_oow_ack_time))
tcp_send_dupack(sk, skb);
goto discard;
}
/* Reset is accepted even if it did not pass PAWS. */
Expand All @@ -5009,7 +5022,10 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
if (!th->rst) {
if (th->syn)
goto syn_challenge;
tcp_send_dupack(sk, skb);
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSEQ,
&tp->last_oow_ack_time))
tcp_send_dupack(sk, skb);
}
goto discard;
}
Expand All @@ -5025,7 +5041,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt)
tcp_reset(sk);
else
tcp_send_challenge_ack(sk);
tcp_send_challenge_ack(sk, skb);
goto discard;
}

Expand All @@ -5039,7 +5055,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
if (syn_inerr)
TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
tcp_send_challenge_ack(sk);
tcp_send_challenge_ack(sk, skb);
goto discard;
}

Expand Down
36 changes: 30 additions & 6 deletions net/ipv4/tcp_minisocks.c
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,25 @@ static bool tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
return seq == e_win && seq == end_seq;
}

static enum tcp_tw_status
tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
const struct sk_buff *skb, int mib_idx)
{
struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);

if (!tcp_oow_rate_limited(twsk_net(tw), skb, mib_idx,
&tcptw->tw_last_oow_ack_time)) {
/* Send ACK. Note, we do not put the bucket,
* it will be released by caller.
*/
return TCP_TW_ACK;
}

/* We are rate-limiting, so just release the tw sock and drop skb. */
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
}

/*
* * Main purpose of TIME-WAIT state is to close connection gracefully,
* when one of ends sits in LAST-ACK or CLOSING retransmitting FIN
Expand Down Expand Up @@ -116,7 +135,8 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
!tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
tcptw->tw_rcv_nxt,
tcptw->tw_rcv_nxt + tcptw->tw_rcv_wnd))
return TCP_TW_ACK;
return tcp_timewait_check_oow_rate_limit(
tw, skb, LINUX_MIB_TCPACKSKIPPEDFINWAIT2);

if (th->rst)
goto kill;
Expand Down Expand Up @@ -250,10 +270,8 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
inet_twsk_schedule(tw, &tcp_death_row, TCP_TIMEWAIT_LEN,
TCP_TIMEWAIT_LEN);

/* Send ACK. Note, we do not put the bucket,
* it will be released by caller.
*/
return TCP_TW_ACK;
return tcp_timewait_check_oow_rate_limit(
tw, skb, LINUX_MIB_TCPACKSKIPPEDTIMEWAIT);
}
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
Expand Down Expand Up @@ -289,6 +307,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tcptw->tw_ts_recent = tp->rx_opt.ts_recent;
tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
tcptw->tw_ts_offset = tp->tsoffset;
tcptw->tw_last_oow_ack_time = 0;

#if IS_ENABLED(CONFIG_IPV6)
if (tw->tw_family == PF_INET6) {
Expand Down Expand Up @@ -467,6 +486,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
tcp_enable_early_retrans(newtp);
newtp->tlp_high_seq = 0;
newtp->lsndtime = treq->snt_synack;
newtp->last_oow_ack_time = 0;
newtp->total_retrans = req->num_retrans;

/* So many TCP implementations out there (incorrectly) count the
Expand Down Expand Up @@ -605,7 +625,11 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* Reset timer after retransmitting SYNACK, similar to
* the idea of fast retransmit in recovery.
*/
if (!inet_rtx_syn_ack(sk, req))
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time) &&

!inet_rtx_syn_ack(sk, req))
req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
TCP_RTO_MAX) + jiffies;
return NULL;
Expand Down

0 comments on commit f06535c

Please sign in to comment.