linux

Commit Graph

Author	SHA1	Message	Date
David Ahern	0ae8133586	net: ipv6: Allow shorthand delete of all nexthops in multipath route IPv4 allows multipath routes to be deleted using just the prefix and length. For example: $ ip ro ls vrf red unreachable default metric 8192 1.1.1.0/24 nexthop via 10.100.1.254 dev eth1 weight 1 nexthop via 10.11.200.2 dev eth11.200 weight 1 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 $ ip ro del 1.1.1.0/24 vrf red $ ip ro ls vrf red unreachable default metric 8192 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3 The same notation does not work with IPv6 because of how multipath routes are implemented for IPv6. For IPv6 only the first nexthop of a multipath route is deleted if the request contains only a prefix and length. This leads to unnecessary complexity in userspace dealing with IPv6 multipath routes. This patch allows all nexthops to be deleted without specifying each one in the delete request. Internally, this is done by walking the sibling list of the route matching the specifications given (prefix, length, metric, protocol, etc). $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium 2001:db8:200::/120 via 2001:db8:1::2 dev eth1 metric 1024 pref medium 2001:db8:200::/120 via 2001:db8:2::2 dev eth2 metric 1024 pref medium ... $ ip -6 ro del vrf red 2001:db8:200::/120 $ ip -6 ro ls vrf red 2001:db8:1::/120 dev eth1 proto kernel metric 256 pref medium 2001:db8:2::/120 dev eth2 proto kernel metric 256 pref medium ... Because IPv6 allows individual nexthops to be deleted without deleting the entire route, the ip6_route_multipath_del and non-multipath code path (ip6_route_del) have to be discriminated so that all nexthops are only deleted for the latter case. This is done by making the existing fc_type in fib6_config a u16 and then adding a new u16 field with fc_delete_all_nh as the first bit. Suggested-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-04 19:58:14 -05:00
Eric Dumazet	4d6308aac3	virtio_net: exploit napi_complete_done() return value Since commit `364b605573` ("net: busy-poll: return busypolling status to drivers"), napi_complete_done() returns a boolean that can be used by drivers to conditionally rearm interrupts. This patch changes virtio_net to use this boolean to avoid a bit of overhead for busy-poll users. Jason reports about 1.1% improvement for 1 byte TCP_RR (burst 100). Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-04 19:38:28 -05:00
David S. Miller	a076d1bdc6	Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== 40GbE Intel Wired LAN Driver Updates 2017-02-03 This series contains updates to i40e/i40evf only. Jake fixes up the driver to not call i40e_vsi_kill_vlan() or i40e_vsi_add_vlan() when the PVID is set or when the VID is less than 1. Cleaned up a check which really is not needed since there is no real reason why we cannot just call i40e_del_mac_all_vlan() directly. Renamed functions to better reflect their actual purpose and how they function in a more clear manner. Bimmy cleans up unused/deprecated macros. Mitch cleans up unused device ids which were intended for use when running Linux VF drivers under Hyper-V, but found to be not needed. Then cleaned up a function that is no longer needed since the client open and close functions were refactored. Adds a sleep without timeout until the reply from the PF driver has been received since the iWARP client cannot continue until the operation has been completed. Tushar Dave fixes an issue seen on SPARC where the use of the 'packed' directive was causing kernel unaligned errors. Alex does a refactor to pull some data off of the stack and store it in the transmit buffer info section of the transmit ring. Alan fixes a bug which was caused by passing a bad register value to the firmware, by refactoring the macro INTRL_USEC_TO_REG into a static inline function. Also added feedback to the user as to the actual interrupt rate limit being used when it differs from the requested limit. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-04 12:13:27 -05:00
Eric Dumazet	6e7bc478c9	net: skb_needs_check() accepts CHECKSUM_NONE for tx My recent change missed fact that UFO would perform a complete UDP checksum before segmenting in frags. In this case skb->ip_summed is set to CHECKSUM_NONE. We need to add this valid case to skb_needs_check() Fixes: `b2504a5dbe` ("net: reduce skb_warn_bad_offload() noise") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 17:33:01 -05:00
Eric Dumazet	79e7fff47b	net: remove support for per driver ndo_busy_poll() We added generic support for busy polling in NAPI layer in linux-4.5 No network driver uses ndo_busy_poll() anymore, we can get rid of the pointer in struct net_device_ops, and its use in sk_busy_loop() Saves NETIF_F_BUSY_POLL features bit. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 17:28:29 -05:00
David S. Miller	7a655c6324	enic: Remove local ndo_busy_poll() implementation. We do polling generically these days. Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 17:28:21 -05:00
Eric Dumazet	508aac6dee	ixgbevf: get rid of custom busy polling code In linux-4.5, busy polling was implemented in core NAPI stack, meaning that all custom implementation can be removed from drivers. Not only we remove lot's of code, we also remove one lock operation in fast path, and allow GRO to do its job. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 17:17:53 -05:00
Eric Dumazet	3ffc1af576	ixgbe: get rid of custom busy polling code In linux-4.5, busy polling was implemented in core NAPI stack, meaning that all custom implementation can be removed from drivers. Not only we remove lot's of code, we also remove one lock operation in fast path, and allow GRO to do its job. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 17:17:52 -05:00
David S. Miller	52e01b84a2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:58:20 -05:00
David S. Miller	e60df62492	Merge branch 'mlxsw-Introduce-TC-Flower-offload-using-TCAM' Jiri Pirko says: ==================== mlxsw: Introduce TC Flower offload using TCAM This patchset introduces support for offloading TC cls_flower and actions to Spectrum TCAM-base policy engine. The patchset contains patches to allow work with flexible keys and actions which are used in Spectrum TCAM. It also contains in-driver infrastructure for offloading TC rules to TCAM HW. The TCAM management code is simple and limited for now. It is going to be extended as a follow-up work. The last patch uses the previously introduced infra to allow to implement cls_flower offloading. Initially, only limited set of match-keys and only a drop and forward actions are supported. As a dependency, this patchset introduces parman - priority array area manager - as a library. v1->v2: - patch11: - use __set_bit and __test_and_clear_bit as suggested by DaveM - patch16: - Added documentation to the API functions as suggested by Tom Herbert - patch17: - use __set_bit and __clear_bit as suggested by DaveM ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:44 -05:00
Jiri Pirko	7aa0f5aa90	mlxsw: spectrum: Implement TC flower offload Extend the existing setup_tc ndo call and allow to offload cls_flower rules. Only limited set of dissector keys and actions are supported now. Use previously introduced ACL infrastructure to offload cls_flower rules to be processed in the HW. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:43 -05:00
Jiri Pirko	69ca05ce9d	sched: cls_flower: expose priority to offloading netdevice The driver that offloads flower rules needs to know with which priority user inserted the rules. So add this information into offload struct. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:43 -05:00
Jiri Pirko	22a677661f	mlxsw: spectrum: Introduce ACL core with simple TCAM implementation Add ACL core infrastructure for Spectrum ASIC. This infra provides an abstraction layer over specific HW implementations. There are two basic objects used. One is "rule" and the second is "ruleset" which serves as a container of multiple rules. In general, within one ruleset the rules are allowed to have multiple priorities and masks. Each ruleset is bound to either ingress or egress a of port netdevice. The initial TCAM implementation is very simple and limited. It utilizes parman lsort manager to take care of TCAM region layout. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:43 -05:00
Jiri Pirko	44091d29f2	lib: Introduce priority array area manager This introduces a infrastructure for management of linear priority areas. Priority order in an array matters, however order of items inside a priority group does not matter. As an initial implementation, L-sort algorithm is used. It is quite trivial. More advanced algorithm called P-sort will be introduced as a follow-up. The infrastructure is prepared for other algos. Alongside this, a testing module is introduced as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:42 -05:00
Jiri Pirko	b862815c3e	list: introduce list_for_each_entry_from_reverse helper Similar to list_for_each_entry_continue and its reverse variant list_for_each_entry_continue_reverse, introduce reverse helper for list_for_each_entry_from. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:42 -05:00
Jiri Pirko	8708ecf01d	mlxsw: resources: Add ACL related resources Add couple of resource limits related to ACL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:42 -05:00
Jiri Pirko	b876b9aaad	mlxsw: spectrum: Introduce basic set of flexible key blocks Introduce basic set of Spectrum flexible key blocks. It contains blocks needed to carry all elements defined so far. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:41 -05:00
Jiri Pirko	4cda7d8d70	mlxsw: core: Introduce flexible actions support Each entry which is matched during ACL lookup points to an action set. This action set contains up to three separate actions. If more actions are needed to be chained, the extended set is created to hold them in KVD linear area. This patch implements handling of sets and encoding of actions. Currectly, only two actions are supported. Drop and forward. Forward action uses PBS pointer to KVD linear area, so the action code needs to take care of this as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:41 -05:00
Jiri Pirko	3f1a84e696	mlxsw: core: Introduce flexible keys support Hardware supports matching on so called "flexible keys". The idea is to assemble an optimal key to use for matching according to the fields in packet (elements) requested by user. Certain sets of elements are combined into pre-defined blocks. There is a picker to find needed blocks. Keys consist of 1..n blocks. Alongside with that, an initial portion of elements is introduced in order to be able to offload basic cls_flower rules. Picked keys are cached so multiple rules could share them. There is an encode function provided that takes care of encoding key and mask values according to given key. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:41 -05:00
Jiri Pirko	e3426e12fe	mlxsw: reg: Add Policy-Engine Extended Flexible Action Register PEFA register is used for accessing an extended flexible action entry in the central KVD Linear Database. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:40 -05:00
Jiri Pirko	d120649d86	mlxsw: reg: Add Policy-Engine Policy Based Switching Register The PPBS register retrieves and sets Policy Based Switching Table entries. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:40 -05:00
Jiri Pirko	937b682cc0	mlxsw: reg: Add Policy-Engine Rules Copy Register The PRCR register is used for accessing rules within a TCAM region. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:40 -05:00
Jiri Pirko	af7170eee6	mlxsw: reg: Add Policy-Engine Port Binding Table The PPBT is used for configuration of the Port Binding Table. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:39 -05:00
Jiri Pirko	0171cdec03	mlxsw: reg: Add Policy-Engine TCAM Entry Register Version 2 The PTCE-V2 register is used for accessing rules within a TCAM region. It is a new version of PTCE in order to support wider key, mask and action within a TCAM region. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:39 -05:00
Jiri Pirko	d9c2661e1c	mlxsw: reg: Add Policy-Engine TCAM Allocation Register The PTAR register is used for allocation of regions in the TCAM. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:39 -05:00
Jiri Pirko	10fabef513	mlxsw: reg: Add Policy-Engine ACL Group Table register The PAGT register is used for configuration of the ACL Group Table. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:38 -05:00
Jiri Pirko	3279da4c88	mlxsw: reg: Add Policy-Engine ACL Register The PACL register is used for configuration of the ACL. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:38 -05:00
Jiri Pirko	d5e556c6a1	mlxsw: item: Add helpers for getting pointer into payload for char buffer item Sometimes it is handy to get a pointer to a char buffer item and use it direcly to write/read data. So add these helpers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:38 -05:00
Jiri Pirko	2946fde9fd	mlxsw: item: Add 8bit item helpers Item heplers for 8bit values are needed, let's add them. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:35:37 -05:00
Zhu Yanjun	3d67576da1	bonding: Remove unnecessary returned value check The function bond_info_query alwarys returns 0. As such, in the function bond_do_ioctl, it is not necessary to check the returned value. So the interface type of the function bond_info_query is changed to void. The redundant check is removed. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:25:17 -05:00
Eric Dumazet	38ab52e8e1	tcp: clear pfmemalloc on outgoing skb Josef Bacik diagnosed following problem : I was seeing random disconnects while testing NBD over loopback. This turned out to be because NBD sets pfmemalloc on it's socket, however the receiving side is a user space application so does not have pfmemalloc set on its socket. This means that sk_filter_trim_cap will simply drop this packet, under the assumption that the other side will simply retransmit. Well we do retransmit, and then the packet is just dropped again for the same reason. It seems the better way to address this problem is to clear pfmemalloc in the TCP transmit path. pfmemalloc strict control really makes sense on the receive path. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:23:57 -05:00
Eric Dumazet	5226b79196	cxgb4: get rid of custom busy poll code In linux-4.5, busy polling was implemented in core NAPI stack, meaning that all custom implementation can be removed from drivers. Not only we remove lot of code, we also remove one spin_lock() from driver fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:18:01 -05:00
Eric Dumazet	362108b5ad	myri10ge: get rid of custom busy poll code Compared to custom busy_poll, the generic NAPI one is simpler and removes a lot of code. It removes one atomic in the fast path (when busy poll is not in action) since we do not have to use an extra spinlock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:18:00 -05:00
Eric Dumazet	fb6113e688	be2net: get rid of custom busy poll code Compared to custom busy_poll, the generic NAPI one is better, since it allows to use GRO, and it removes a lot of code and extra locked operations in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sathya Perla <sathya.perla@broadcom.com> Cc: Ajit Khaparde <ajit.khaparde@broadcom.com> Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:06:55 -05:00
David Ahern	94b5e0f970	net: ipv6: Set protocol to kernel for local routes IPv6 stack does not set the protocol for local routes, so those routes show up with proto "none": $ ip -6 ro ls table local local ::1 dev lo proto none metric 0 pref medium local 2100:3:: dev lo proto none metric 0 pref medium local 2100:3::4 dev lo proto none metric 0 pref medium local fe80:: dev lo proto none metric 0 pref medium ... Set rt6i_protocol to RTPROT_KERNEL for consistency with IPv4. Now routes show up with proto "kernel": $ ip -6 ro ls table local local ::1 dev lo proto kernel metric 0 pref medium local 2100:3:: dev lo proto kernel metric 0 pref medium local 2100:3::4 dev lo proto kernel metric 0 pref medium local fe80:: dev lo proto kernel metric 0 pref medium ... Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 16:01:44 -05:00
Daniel Borkmann	3898fac1f4	trace: rename trace_print_hex_seq arg and add kdoc Steven suggested to improve trace_print_hex_seq() a bit after commit `2acae0d5b0` ("trace: add variant without spacing in trace_print_hex_seq") in two ways: i) by adding a kdoc comment for the helper function itself and ii) by renaming 'spacing' argument into 'concatenate' to better denote that we don't add spaces between each hex bytes. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:50:18 -05:00
Jiri Pirko	f38c5ad79f	MAINTAINERS: add Ivan as a switchdev maintainer Ivan will be taking care of switchdev code from now on. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:27:23 -05:00
David S. Miller	3b19860c7f	Merge branch 'bridge-per-vlan-dst_metadata-support' Roopa Prabhu says: ==================== bridge: per vlan dst_metadata support High level summary: lwt and dst_metadata have enabled vxlan l3 deployments to use a single vxlan netdev for multiple vnis eliminating the scalability problem with using a single vxlan netdev per vni. This series tries to do the same for vxlan netdevs in pure l2 bridged networks. Use-case/deployment and details are below. Deployment scerario details: As we know VXLAN is used to build layer 2 virtual networks across the underlay layer3 infrastructure. A VXLAN tunnel endpoint (VTEP) originates and terminates VXLAN tunnels. And a VTEP can be a TOR switch or a vswitch in the hypervisor. This patch series mainly focuses on the TOR switch configured as a Vtep. Vxlan segment ID (vni) along with vlan id is used to identify layer 2 segments in a vxlan overlay network. Vxlan bridging is the function provided by Vteps to terminate vxlan tunnels and map the vxlan vni to traditional end host vlan. This is covered in the "VXLAN Deployment Scenarios" in sections 6 and 6.1 in RFC 7348. To provide vxlan bridging function, a vtep has to map vlan to a vni. The rfc says that the ingress VTEP device shall remove the IEEE 802.1Q VLAN tag in the original Layer 2 packet if there is one before encapsulating the packet into the VXLAN format to transmit it through the underlay network. The remote VTEP devices have information about the VLAN in which the packet will be placed based on their own VLAN-to-VXLAN VNI mapping configurations. Existing solution: Without this patch series one can deploy such a vtep configuration by adding the local ports and vxlan netdevs into a vlan filtering bridge. The local ports are configured as trunk ports carrying all vlans. A vxlan netdev per vni is added to the bridge. Vlan mapping to vni is achieved by configuring the vlan as pvid on the corresponding vxlan netdev. The vxlan netdev only receives traffic corresponding to the vlan it is mapped to. This configuration maps traffic belonging to a vlan to the corresponding vxlan segment. ----------------------------------- \| bridge \| \| \| ----------------------------------- \|100,200 \|100 (pvid) \|200 (pvid) \| \| \| swp1 vxlan1000 vxlan2000 This provides the required vxlan bridging function but poses a scalability problem with using a separate vxlan netdev for each vni. Solution in this patch series: The Goal is to use a single vxlan device to carry all vnis similar to the vxlan collect metadata mode but additionally allowing the bridge and vxlan driver to carry all the forwarding information and also learn. This implementation uses the existing dst_metadata infrastructure to map vlan to a tunnel id. - vxlan driver changes: - enable collect metadata mode to be used with learning, replication and fdb - A single fdb table hashed by (mac, vni) - rx path already has the vni - tx path expects a vni in the packet with dst_metadata and relies on learnt or static forwarding information table to forward the packet - Bridge driver changes: per vlan dst_metadata support: - Our use case is vxlan and 1-1 mapping between vlan and vni, but I have kept the api generic for any tunnel info - Uapi to configure/unconfigure/dump per vlan tunnel data - new bridge port flag to turn this feature on/off. off by default - ingress hook: - if port is a tunnel port, use tunnel info in attached dst_metadata to map it to a local vlan - egress hook: - if port is a tunnel port, use tunnel info attached to vlan to set dst_metadata on the skb Other approaches tried and vetoed: - tc vlan push/pop and tunnel metadata dst: - though tc can be used to do part of this, these patches address a deployment case where bridge driver vlan filtering and forwarding information database along with vxlan driver forwarding information table and learning are required. - making vxlan driver understand vlan-vni mapping: - I had a series almost ready with this one but soon realized it duplicated a lot of vlan handling code in the vxlan driver ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:23 -05:00
Roopa Prabhu	11538d039a	bridge: vlan dst_metadata hooks in ingress and egress paths - ingress hook: - if port is a tunnel port, use tunnel info in attached dst_metadata to map it to a local vlan - egress hook: - if port is a tunnel port, use tunnel info attached to vlan to set dst_metadata on the skb CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:22 -05:00
Roopa Prabhu	efa5356b0d	bridge: per vlan dst_metadata netlink support This patch adds support to attach per vlan tunnel info dst metadata. This enables bridge driver to map vlan to tunnel_info at ingress and egress. It uses the kernel dst_metadata infrastructure. The initial use case is vlan to vni bridging, but the api is generic to extend to any tunnel_info in the future: - Uapi to configure/unconfigure/dump per vlan tunnel data - netlink functions to configure vlan and tunnel_info mapping - Introduces bridge port flag BR_LWT_VLAN to enable attach/detach dst_metadata to bridged packets on ports. off by default. - changes to existing code is mainly refactor some existing vlan handling netlink code + hooks for new vlan tunnel code - I have kept the vlan tunnel code isolated in separate files. - most of the netlink vlan tunnel code is handling of vlan-tunid ranges (follows the vlan range handling code). To conserve space vlan-tunid by default are always dumped in ranges if applicable. Use case: example use for this is a vxlan bridging gateway or vtep which maps vlans to vn-segments (or vnis). iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's and vlan 100 maps to vni 1000, vlan 101 maps to vni 1001 before (netdev per vni): $bridge fdb show \| grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 vlan 101 master bridge 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 vlan 100 master bridge 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metdata in bridged mode (single netdev): $bridge fdb show \| grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 vlan 101 master bridge 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 vlan 100 master bridge 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:22 -05:00
Roopa Prabhu	b3c7ef0ada	bridge: uapi: add per vlan tunnel info New nested netlink attribute to associate tunnel info per vlan. This is used by bridge driver to send tunnel metadata to bridge ports in vlan tunnel mode. This patch also adds new per port flag IFLA_BRPORT_VLAN_TUNNEL to enable vlan tunnel mode. off by default. One example use for this is a vxlan bridging gateway or vtep which maps vlans to vn-segments (or vnis). User can configure per-vlan tunnel information which the bridge driver can use to bridge vlan into the corresponding vn-segment. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:21 -05:00
Roopa Prabhu	3ad7a4b141	vxlan: support fdb and learning in COLLECT_METADATA mode Vxlan COLLECT_METADATA mode today solves the per-vni netdev scalability problem in l3 networks. It expects all forwarding information to be present in dst_metadata. This patch series enhances collect metadata mode to include the case where only vni is present in dst_metadata, and the vxlan driver can then use the rest of the forwarding information datbase to make forwarding decisions. There is no change to default COLLECT_METADATA behaviour. These changes only apply to COLLECT_METADATA when used with the bridging use-case with a special dst_metadata tunnel info flag (eg: where vxlan device is part of a bridge). For all this to work, the vxlan driver will need to now support a single fdb table hashed by mac + vni. This series essentially makes this happen. use-case and workflow: vxlan collect metadata device participates in bridging vlan to vn-segments. Bridge driver above the vxlan device, sends the vni corresponding to the vlan in the dst_metadata. vxlan driver will lookup forwarding database with (mac + vni) for the required remote destination information to forward the packet. Changes introduced by this patch: - allow learning and forwarding database state in vxlan netdev in COLLECT_METADATA mode. Current behaviour is not changed by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used to support the new bridge friendly mode. - A single fdb table hashed by (mac, vni) to allow fdb entries with multiple vnis in the same fdb table - rx path already has the vni - tx path expects a vni in the packet with dst_metadata - prior to this series, fdb remote_dsts carried remote vni and the vxlan device carrying the fdb table represented the source vni. With the vxlan device now representing multiple vnis, this patch adds a src vni attribute to the fdb entry. The remote vni already uses NDA_VNI attribute. This patch introduces NDA_SRC_VNI netlink attribute to represent the src vni in a multi vni fdb table. iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's. before (netdev per vni): $bridge fdb show \| grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metadata in bridged mode (single netdev): $bridge fdb show \| grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:21 -05:00
Roopa Prabhu	f35581d64e	ip_tunnels: new IP_TUNNEL_INFO_BRIDGE flag for ip_tunnel_info mode New ip_tunnel_info flag to represent bridged tunnel metadata. Used by bridge driver later in the series to pass per vlan dst metadata to bridge ports. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:21:21 -05:00
David S. Miller	5a0fd98b7b	Merge branch 'ife-to-module' Yotam Gigi says: ==================== Extract IFE logic to module Extract ife logic from the tc_ife action into an independent module, and make the tc_ife action use it. This way, the ife encapsulation can be used by other modules other than tc_ife action. v1->v2: Fix duplicate symbol error by introducing a new patch that makes the original symbol static. The symbol ife_tlv_meta_extract is exported in act_ife, though not being used by any other module. As the symbol is being moved to the new ife module, introducing the new module creates duplicate symbol. To fix it, add a new patch (1/3) that makes the ife_tlv_meta_extract symbol static in act_ife, thus the symbol does not collide. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:16:46 -05:00
Yotam Gigi	295a6e06d2	net/sched: act_ife: Change to use ife module Use the encode/decode functionality from the ife module instead of using implementation inside the act_ife. Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:16:46 -05:00
Yotam Gigi	1ce8460496	net: Introduce ife encapsulation module This module is responsible for the ife encapsulation protocol encode/decode logics. That module can: - ife_encode: encode skb and reserve space for the ife meta header - ife_decode: decode skb and extract the meta header size - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife header space. - ife_tlv_meta_decode - decodes one tlv entry from the packet - ife_tlv_meta_next - advance to the next tlv Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:16:45 -05:00
Yotam Gigi	1d5e7c859e	net/sched: act_ife: Unexport ife_tlv_meta_encode As the function ife_tlv_meta_encode is not used by any other module, unexport it and make it static for the act_ife module. Signed-off-by: Yotam Gigi <yotamg@mellanox.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 15:16:45 -05:00
Eric Dumazet	3541f9e8bd	tcp: add tcp_mss_clamp() helper Small cleanup factorizing code doing the TCP_MAXSEG clamping. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 11:19:34 -05:00
Arnd Bergmann	ff3edc9b8e	hns_enet: use cpumask_var_t for on-stack mask On large SMP builds, we can run into a build warning: drivers/net/ethernet/hisilicon/hns/hns_enet.c: In function 'hns_set_irq_affinity.isra.27': drivers/net/ethernet/hisilicon/hns/hns_enet.c:1242:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=] The solution here is to use cpumask_var_t, which can use dynamic allocation when CONFIG_CPUMASK_OFFSTACK is enabled. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 11:14:50 -05:00
Eric Dumazet	ceef438d61	virtio_net: remove custom busy_poll Generic NAPI busy polling allows us to remove custom implementations found in drivers. It is possible further optimization could be done by testing napi_complete_done() return value. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-03 11:13:55 -05:00

1 2 3 4 5 ...

650529 Commits All Branches Search

650529 Commits

All Branches