Skip to content

Commit

Permalink
Merge branch 'vrf-allow-simultaneous-service-instances-in-default-and…
Browse files Browse the repository at this point in the history
…-other-VRFs'

Mike Manning says:

====================
vrf: allow simultaneous service instances in default and other VRFs

Services currently have to be VRF-aware if they are using an unbound
socket. One cannot have multiple service instances running in the
default and other VRFs for services that are not VRF-aware and listen
on an unbound socket. This is because there is no easy way of isolating
packets received in the default VRF from those arriving in other VRFs.

This series provides this isolation for stream sockets subject to the
existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
given that this is documented as allowing a single service instance to
work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
introduced for raw sockets. The functionality applies to UDP & TCP
services as well as those using raw sockets, and is for IPv4 and IPv6.

Example of running ssh instances in default and blue VRF:

$ /usr/sbin/sshd -D
$ ip vrf exec vrf-blue /usr/sbin/sshd
$ ss -ta | egrep 'State|ssh'
State   Recv-Q   Send-Q           Local Address:Port       Peer Address:Port
LISTEN  0        128           0.0.0.0%vrf-blue:ssh             0.0.0.0:*
LISTEN  0        128                    0.0.0.0:ssh             0.0.0.0:*
ESTAB   0        0              192.168.122.220:ssh       192.168.122.1:50282
LISTEN  0        128              [::]%vrf-blue:ssh                [::]:*
LISTEN  0        128                       [::]:ssh                [::]:*
ESTAB   0        0           [3000::2]%vrf-blue:ssh           [3000::9]:45896
ESTAB   0        0                    [2000::2]:ssh           [2000::9]:46398

v1:
   - Address Paolo Abeni's comments (patch 4/5)
   - Fix build when CONFIG_NET_L3_MASTER_DEV not defined (patch 1/5)
v2:
   - Address David Aherns' comments (patches 4/5 and 5/5)
   - Remove patches 3/5 and 5/5 from series for individual submissions
   - Include a sysctl for raw sockets as recommended by David Ahern
   - Expand series into 10 patches and provide improved descriptions
v3:
   - Update description for patch 1/10 and remove patch 6/10
v4:
   - Set default to enabled for raw socket sysctl as recommended by David Ahern
v5:
   - Address review comments from David Ahern in patches 2-5
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
  • Loading branch information
David S. Miller committed Nov 8, 2018
2 parents f601a85 + 7bd2db4 commit 7e22561
Show file tree
Hide file tree
Showing 22 changed files with 243 additions and 84 deletions.
12 changes: 12 additions & 0 deletions Documentation/networking/ip-sysctl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,7 @@ tcp_l3mdev_accept - BOOLEAN
derived from the listen socket to be bound to the L3 domain in
which the packets originated. Only valid when the kernel was
compiled with CONFIG_NET_L3_MASTER_DEV.
Default: 0 (disabled)

tcp_low_latency - BOOLEAN
This is a legacy option, it has no effect anymore.
Expand Down Expand Up @@ -773,6 +774,7 @@ udp_l3mdev_accept - BOOLEAN
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
Default: 0 (disabled)

udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets.
Expand All @@ -799,6 +801,16 @@ udp_wmem_min - INTEGER
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
Default: 4K

RAW variables:

raw_l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
across L3 master domains (e.g., VRFs) with packets capable of
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
Default: 1 (enabled)

CIPSOv4 Variables:

cipso_cache_enable - BOOLEAN
Expand Down
22 changes: 18 additions & 4 deletions Documentation/networking/vrf.txt
Original file line number Diff line number Diff line change
Expand Up @@ -103,19 +103,33 @@ VRF device:

or to specify the output device using cmsg and IP_PKTINFO.

By default the scope of the port bindings for unbound sockets is
limited to the default VRF. That is, it will not be matched by packets
arriving on interfaces enslaved to an l3mdev and processes may bind to
the same port if they bind to an l3mdev.

TCP & UDP services running in the default VRF context (ie., not bound
to any VRF device) can work across all VRF domains by enabling the
tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1

These options are disabled by default so that a socket in a VRF is only
selected for packets in that VRF. There is a similar option for RAW
sockets, which is enabled by default for reasons of backwards compatibility.
This is so as to specify the output device with cmsg and IP_PKTINFO, but
using a socket not bound to the corresponding VRF. This allows e.g. older ping
implementations to be run with specifying the device but without executing it
in the VRF. This option can be disabled so that packets received in a VRF
context are only handled by a raw socket bound to the VRF, and packets in the
default VRF are only handled by a socket not bound to any VRF:

sysctl -w net.ipv4.raw_l3mdev_accept=0

netfilter rules on the VRF device can be used to limit access to services
running in the default VRF context as well.

The default VRF does not have limited scope with respect to port bindings.
That is, if a process does a wildcard bind to a port in the default VRF it
owns the port across all VRF domains within the network namespace.

################################################################################

Using iproute2 for VRFs
Expand Down
19 changes: 9 additions & 10 deletions drivers/net/vrf.c
Original file line number Diff line number Diff line change
Expand Up @@ -981,24 +981,23 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev,
struct sk_buff *skb)
{
int orig_iif = skb->skb_iif;
bool need_strict;
bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
bool is_ndisc = ipv6_ndisc_frame(skb);

/* loopback traffic; do not push through packet taps again.
* Reset pkt_type for upper layers to process skb
/* loopback, multicast & non-ND link-local traffic; do not push through
* packet taps again. Reset pkt_type for upper layers to process skb
*/
if (skb->pkt_type == PACKET_LOOPBACK) {
if (skb->pkt_type == PACKET_LOOPBACK || (need_strict && !is_ndisc)) {
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
IP6CB(skb)->flags |= IP6SKB_L3SLAVE;
skb->pkt_type = PACKET_HOST;
if (skb->pkt_type == PACKET_LOOPBACK)
skb->pkt_type = PACKET_HOST;
goto out;
}

/* if packet is NDISC or addressed to multicast or link-local
* then keep the ingress interface
*/
need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr);
if (!ipv6_ndisc_frame(skb) && !need_strict) {
/* if packet is NDISC then keep the ingress interface */
if (!is_ndisc) {
vrf_rx_stats(vrf_dev, skb->len);
skb->dev = vrf_dev;
skb->skb_iif = vrf_dev->ifindex;
Expand Down
5 changes: 2 additions & 3 deletions include/net/inet6_hashtables.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,8 @@ int inet6_hash(struct sock *sk);
((__sk)->sk_family == AF_INET6) && \
ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr)) && \
ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr)) && \
(!(__sk)->sk_bound_dev_if || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
(((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))

#endif /* _INET6_HASHTABLES_H */
24 changes: 17 additions & 7 deletions include/net/inet_hashtables.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ struct inet_ehash_bucket {

struct inet_bind_bucket {
possible_net_t ib_net;
int l3mdev;
unsigned short port;
signed char fastreuse;
signed char fastreuseport;
Expand Down Expand Up @@ -188,10 +189,21 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
hashinfo->ehash_locks = NULL;
}

static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_tcp_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}

struct inet_bind_bucket *
inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
struct inet_bind_hashbucket *head,
const unsigned short snum);
const unsigned short snum, int l3mdev);
void inet_bind_bucket_destroy(struct kmem_cache *cachep,
struct inet_bind_bucket *tb);

Expand Down Expand Up @@ -282,9 +294,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_addrpair == (__cookie)) && \
(!(__sk)->sk_bound_dev_if || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
(((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
Expand All @@ -294,9 +305,8 @@ static inline struct sock *inet_lookup_listener(struct net *net,
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_daddr == (__saddr)) && \
((__sk)->sk_rcv_saddr == (__daddr)) && \
(!(__sk)->sk_bound_dev_if || \
((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
(((__sk)->sk_bound_dev_if == (__dif)) || \
((__sk)->sk_bound_dev_if == (__sdif))) && \
net_eq(sock_net(__sk), (__net)))
#endif /* 64-bit arch */

Expand Down
21 changes: 21 additions & 0 deletions include/net/inet_sock.h
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,27 @@ static inline int inet_request_bound_dev_if(const struct sock *sk,
return sk->sk_bound_dev_if;
}

static inline int inet_sk_bound_l3mdev(const struct sock *sk)
{
#ifdef CONFIG_NET_L3_MASTER_DEV
struct net *net = sock_net(sk);

if (!net->ipv4.sysctl_tcp_l3mdev_accept)
return l3mdev_master_ifindex_by_index(net,
sk->sk_bound_dev_if);
#endif

return 0;
}

static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
int dif, int sdif)
{
if (!bound_dev_if)
return !sdif || l3mdev_accept;
return bound_dev_if == dif || bound_dev_if == sdif;
}

struct inet_cork {
unsigned int flags;
__be32 addr;
Expand Down
3 changes: 3 additions & 0 deletions include/net/netns/ipv4.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,9 @@ struct netns_ipv4 {
/* Shall we try to damage output packets if routing dev changes? */
int sysctl_ip_dynaddr;
int sysctl_ip_early_demux;
#ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_raw_l3mdev_accept;
#endif
int sysctl_tcp_early_demux;
int sysctl_udp_early_demux;

Expand Down
14 changes: 13 additions & 1 deletion include/net/raw.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
#ifndef _RAW_H
#define _RAW_H


#include <net/inet_sock.h>
#include <net/protocol.h>
#include <linux/icmp.h>

Expand Down Expand Up @@ -61,6 +61,7 @@ void raw_seq_stop(struct seq_file *seq, void *v);

int raw_hash_sk(struct sock *sk);
void raw_unhash_sk(struct sock *sk);
void raw_init(void);

struct raw_sock {
/* inet_sock has to be the first member */
Expand All @@ -74,4 +75,15 @@ static inline struct raw_sock *raw_sk(const struct sock *sk)
return (struct raw_sock *)sk;
}

static inline bool raw_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_raw_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}

#endif /* _RAW_H */
11 changes: 11 additions & 0 deletions include/net/udp.h
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,17 @@ static inline int udp_rqueue_get(struct sock *sk)
return sk_rmem_alloc_get(sk) - READ_ONCE(udp_sk(sk)->forward_deficit);
}

static inline bool udp_sk_bound_dev_eq(struct net *net, int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!net->ipv4.sysctl_udp_l3mdev_accept,
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}

/* net/ipv4/udp.c */
void udp_destruct_sock(struct sock *sk);
void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
Expand Down
2 changes: 2 additions & 0 deletions net/core/sock.c
Original file line number Diff line number Diff line change
Expand Up @@ -567,6 +567,8 @@ static int sock_setbindtodevice(struct sock *sk, char __user *optval,

lock_sock(sk);
sk->sk_bound_dev_if = index;
if (sk->sk_prot->rehash)
sk->sk_prot->rehash(sk);
sk_dst_reset(sk);
release_sock(sk);

Expand Down
2 changes: 2 additions & 0 deletions net/ipv4/af_inet.c
Original file line number Diff line number Diff line change
Expand Up @@ -1964,6 +1964,8 @@ static int __init inet_init(void)
/* Add UDP-Lite (RFC 3828) */
udplite4_register();

raw_init();

ping_init();

/*
Expand Down
13 changes: 10 additions & 3 deletions net/ipv4/inet_connection_sock.c
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,9 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
int i, low, high, attempt_half;
struct inet_bind_bucket *tb;
u32 remaining, offset;
int l3mdev;

l3mdev = inet_sk_bound_l3mdev(sk);
attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
other_half_scan:
inet_get_local_port_range(net, &low, &high);
Expand Down Expand Up @@ -219,7 +221,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
hinfo->bhash_size)];
spin_lock_bh(&head->lock);
inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->port == port) {
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port) {
if (!inet_csk_bind_conflict(sk, tb, false, false))
goto success;
goto next_port;
Expand Down Expand Up @@ -293,6 +296,9 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
struct net *net = sock_net(sk);
struct inet_bind_bucket *tb = NULL;
kuid_t uid = sock_i_uid(sk);
int l3mdev;

l3mdev = inet_sk_bound_l3mdev(sk);

if (!port) {
head = inet_csk_find_open_port(sk, &tb, &port);
Expand All @@ -306,11 +312,12 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
hinfo->bhash_size)];
spin_lock_bh(&head->lock);
inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->port == port)
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port)
goto tb_found;
tb_not_found:
tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
net, head, port);
net, head, port, l3mdev);
if (!tb)
goto fail_unlock;
tb_found:
Expand Down
Loading

0 comments on commit 7e22561

Please sign in to comment.