Enable TSO for VXLAN devices and use UDP_TUNNEL to offload vxlan
segmentation.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following script will produce a kernel oops:
sudo ip netns add v
sudo ip netns exec v ip ad add 127.0.0.1/8 dev lo
sudo ip netns exec v ip link set lo up
sudo ip netns exec v ip ro add 224.0.0.0/4 dev lo
sudo ip netns exec v ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev lo
sudo ip netns exec v ip link set vxlan0 up
sudo ip netns del v
where inspect by gdb:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 107]
0xffffffffa0289e33 in ?? ()
(gdb) bt
#0 vxlan_leave_group (dev=0xffff88001bafa000) at drivers/net/vxlan.c:533
#1 vxlan_stop (dev=0xffff88001bafa000) at drivers/net/vxlan.c:1087
#2 0xffffffff812cc498 in __dev_close_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:1299
#3 0xffffffff812cd920 in dev_close_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:1335
#4 0xffffffff812cef31 in rollback_registered_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:4851
#5 0xffffffff812cf040 in unregister_netdevice_many (head=head@entry=0xffff88001f2e7dc8) at net/core/dev.c:5752
#6 0xffffffff812cf1ba in default_device_exit_batch (net_list=0xffff88001f2e7e18) at net/core/dev.c:6170
#7 0xffffffff812cab27 in cleanup_net (work=<optimized out>) at net/core/net_namespace.c:302
#8 0xffffffff810540ef in process_one_work (worker=0xffff88001ba9ed40, work=0xffffffff8167d020) at kernel/workqueue.c:2157
#9 0xffffffff810549d0 in worker_thread (__worker=__worker@entry=0xffff88001ba9ed40) at kernel/workqueue.c:2276
#10 0xffffffff8105870c in kthread (_create=0xffff88001f2e5d68) at kernel/kthread.c:168
#11 <signal handler called>
#12 0x0000000000000000 in ?? ()
#13 0x0000000000000000 in ?? ()
(gdb) fr 0
#0 vxlan_leave_group (dev=0xffff88001bafa000) at drivers/net/vxlan.c:533
533 struct sock *sk = vn->sock->sk;
(gdb) l
528 static int vxlan_leave_group(struct net_device *dev)
529 {
530 struct vxlan_dev *vxlan = netdev_priv(dev);
531 struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
532 int err = 0;
533 struct sock *sk = vn->sock->sk;
534 struct ip_mreqn mreq = {
535 .imr_multiaddr.s_addr = vxlan->gaddr,
536 .imr_ifindex = vxlan->link,
537 };
(gdb) p vn->sock
$4 = (struct socket *) 0x0
The kernel calls `vxlan_exit_net` when deleting the netns before shutting down
vxlan interfaces. Later the removal of all vxlan interfaces, where `vn->sock`
is already gone causes the oops. so we should manually shutdown all interfaces
before deleting `vn->sock` as the patch does.
Signed-off-by: Zang MingJie <zealot0630@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should reset nf settings bond to the skb as ipip/ipgre do.
If not, the conntrack/nat info bond to the origin packet may continually
redirect the packet to vxlan interface causing a routing loop.
this is the scenario:
VETP VXLAN Gateway
/----\ /---------------\
| | | |
| vx+--+vx --NAT-> eth0+--> Internet
| | | |
\----/ \---------------/
when there are any packet coming from internet to the vetp, there will be lots
of garbage packets coming out the gateway's vxlan interface, but none actually
sent to the physical interface, because they are redirected back to the vxlan
interface in the postrouting chain of NAT rule, and dmesg complains:
Mar 1 21:52:53 debian kernel: [ 8802.997699] Dead loop on virtual device vxlan0, fix it urgently!
Mar 1 21:52:54 debian kernel: [ 8804.004907] Dead loop on virtual device vxlan0, fix it urgently!
Mar 1 21:52:55 debian kernel: [ 8805.012189] Dead loop on virtual device vxlan0, fix it urgently!
Mar 1 21:52:56 debian kernel: [ 8806.020593] Dead loop on virtual device vxlan0, fix it urgently!
the patch should fix the problem
Signed-off-by: Zang MingJie <zealot0630@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tunnel_ip_select_ident() is more efficient when generating ip-header
id given inner packet is of ipv4 type.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a user adds bridge neighbors, allow him to specify VLAN id.
If the VLAN id is not specified, the neighbor will be added
for VLANs currently in the ports filter list. If no VLANs are
configured on the port, we use vlan 0 and only add 1 entry.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The VXLAN pseudo-device doesn't care if the mac address changes
when device is up.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket calls from vxlan to join/leave multicast group aren't
using the index of the underlying device, as a result the stack uses
the first interface that is up. This results in vxlan being non functional
over a device which isn't the 1st to be up.
Fix this by providing the iflink field to the vxlan instance
to the multicast calls.
Signed-off-by: Yan Burman <yanb@mellanox.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds capability in vxlan to identify received
checksummed inner packets and signal them to the upper layers of
the stack. The driver needs to set the skb->encapsulation bit
and also set the skb->ip_summed to CHECKSUM_UNNECESSARY.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow VXLAN to make use of Tx checksum offloading and Tx scatter-gather.
The advantage to these two changes is that it also allows the VXLAN to
make use of GSO.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides extensions to VXLAN for supporting Distributed
Overlay Virtual Ethernet (DOVE) networks. The patch includes:
+ a dove flag per VXLAN device to enable DOVE extensions
+ ARP reduction, whereby a bridge-connected VXLAN tunnel endpoint
answers ARP requests from the local bridge on behalf of
remote DOVE clients
+ route short-circuiting (aka L3 switching). Known destination IP
addresses use the corresponding destination MAC address for
switching rather than going to a (possibly remote) router first.
+ netlink notification messages for forwarding table and L3 switching
misses
Changes since v2
- combined bools into "u32 flags"
- replaced loop with !is_zero_ether_addr()
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes addrexceeded member from vxlan_dev struct as it is unused.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__IPTUNNEL_XMIT() is an ugly macro, convert it to a static
inline function, so make it more readable.
IPTUNNEL_XMIT() is unused, just remove it.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the event of a VXLAN device being linked to a device that has a
hard_header_len greater than that of standard ethernet we could end up with
the hard_header_len not being large enough for outgoing frames. In order to
prevent this we should update the length when a lowerdev is provided.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use eXtensible and not eXtensiable in the comment on top.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change fixes an issue I found where VXLAN frames were fragmented when
they were up to the VXLAN MTU size. I root caused the issue to the fact that
the headroom was 4 + 20 + 8 + 8. This math doesn't appear to be correct
because we are not inserting a VLAN header, but instead a 2nd Ethernet header.
As such the math for the overhead should be 20 + 8 + 8 + 14 to account for the
extra headers that are inserted for VXLAN.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
"ip link add ... type vxlan ... ttl X" allows a user to set the TTL
used by a VXLAN for encapsulation. The provided value was ignored by
vxlan module and the default value of 1 was used when encapsulating
multicast packets.
Signed-off-by: Vincent Bernat <bernat@luffy.cx>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VXLAN confused flag versus bitmap on state.
Based on part of a earlier patch by David Stevens.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If vxlan is created and the ifindex is passed; there are two cases which
are incorrectly handled by the existing code. The ifindex could be zero
(i.e. no device) or there could be no device with that ifindex.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vxlan was trying to use postpull_rcsum to allow receive checksum
offload to work on drivers using CHECKSUM_COMPLETE method. But this
doesn't work correctly. Just force full receive checksum on received
packet.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tell upper layer protocols to allocate skb with additional headroom.
This avoids allocation and copy in local packet sends.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VXLAN bases source UDP port based on flow to help the
receiver to be able to load balance based on outer header flow.
This patch restricts the port range to the normal UDP local
ports, and allows overriding via configuration.
It also uses jhash of Ethernet header when looking at flows
with out know L3 header.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tunnelling a skb, associate it with the tunnel socket.
This allows parameters set on tunnel socket (like multicast loop
flag), to be picked up by ip_output.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Select source address for VXLAN packet based on route destination
and don't lie to route code. VXLAN is not GRE.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Shift was wrong direction causing packets to hash based on
other parts of the ethernet header, not the address.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move code to find destination to a small function.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a couple harmless sparse warnings reported by Fengguang Wu.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove including <linux/version.h> that don't need it.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move vxlan UDP socket to correct network namespace
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an implementation of Virtual eXtensible Local Area Network
as described in draft RFC:
http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
that learns MAC to IP address mapping.
This implementation has not been tested only against the Linux
userspace implementation using TAP, not against other vendor's
equipment.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>