Commit Graph

665860 Commits

Author SHA1 Message Date
Jesper Dangaard Brouer 5010e94842 samples/bpf: bpf_load.c detect and abort if ELF maps section size is wrong
The struct bpf_map_def was extended in commit fb30d4b712 ("bpf: Add tests
for map-in-map") with member unsigned int inner_map_idx.  This changed the size
of the maps section in the generated ELF _kern.o files.

Unfortunately the loader in bpf_load.c does not detect or handle this.  Thus,
older _kern.o files became incompatible, and caused hard-to-debug errors
where the syscall validation rejected BPF_MAP_CREATE request.

This patch only detect the situation and aborts load_bpf_file(). It also
add code comments warning people that read this loader for inspiration
for these pitfalls.

Fixes: fb30d4b712 ("bpf: Add tests for map-in-map")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:59 -04:00
Dan Carpenter 39f3709599 lwtunnel: fix error path in lwtunnel_fill_encap()
We recently added a check to see if nla_nest_start() fails.  There are
two issues with that.  First, if it fails then I don't think we should
call nla_nest_cancel().  Second, it's slightly convoluted but the
current code returns success but we should return -EMSGSIZE instead.

Fixes: a50fe0ffd7 ("lwtunnel: check return value of nla_nest_start")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:29 -04:00
Dan Carpenter 77041e89ce liquidio: silence a locking static checker warning
Presumably we never hit this return, but static checkers complain that
we need to unlock so we may as well fix that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:41:07 -04:00
Dan Carpenter 66117a9d9a qed: Unlock on error in qed_vf_pf_acquire()
My static checker complains that we're holding a mutex on this error
path.  Let's goto exit instead of returning directly.

Fixes: b0bccb69eb ("qed: Change locking scheme for VF channel")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:40:45 -04:00
David S. Miller 1c942c941d Merge branch 'hns-deferred-probe'
lipeng says:

====================
net: hns: bug fix for HNS driver

This patchset add support defered dsaf probe when mdio and
mbigen module is not insmod.

For more details, please refer to individual patch.

change log:
V4 - > V5:
1. Float on net-next;
2. Delete patch "net: hns: fixed bug that skb used after kfree"
   from this patchset;

V3 -> V4:
1. Delete redundant commit message;
2. Add Reviewed-by: Matthias Brugger <mbrugger@suse.com>;

V2 -> V3:
1. Check return value when  platform_get_irq in hns_rcb_get_cfg;

V1 -> V2:
1. Return appropriate errno in hns_mac_register_phy;
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:25 -04:00
lipeng 804ffe5c61 net: hns: support deferred probe when no mdio
In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

We check for probe deferral in the mac init, so we not init DSAF
when there is no mdio, and free all resource, to later learn that
we need to defer the probe.

Signed-off-by: lipeng <lipeng321@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:24 -04:00
lipeng 2fdd6bafe3 net: hns: support deferred probe when can not obtain irq
In the hip06 and hip07 SoCs, the interrupt lines from the
DSAF controllers are connected to mbigen hw module.
The mbigen module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

Signed-off-by: lipeng <lipeng321@huawei.com>
Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com>
Reviewed-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:39:24 -04:00
David S. Miller ba1d82e681 Merge branch 'nfp-XDP_TX-optimizations'
Jakub Kicinski says:

====================
nfp: optimize XDP TX and small fixes

This series optimizes the nfp XDP TX performance a little bit.
I run quick tests on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Single core/queue performance for both touch and drop and touch and
forward is above 20Mpps @64B packets, drop being 2Mpps faster.
I think this is max for a single queue on the low power NFPs.

There are also a few minor fixes included for code in net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:01 -04:00
Jakub Kicinski dbf637ff39 nfp: provide 256 bytes of XDP headroom in all configurations
For legacy reasons NFP FW may be compiled to DMA packets to a constant
offset into the buffer and use the space before it for metadata.  This
ensures that packets data always start at a certain offset regardless of
the amount of preceding metadata.

If rx offset is set to 0 there may still be up to 64 bytes of metadata
but metadata will start at the beginning of the buffer, instead of:

    data_start_offset = rx_offset - meta_len

Even though we make the buffers larger to accommodate up to 64 bytes of
metadata, if there is only N bytes of metadata, we will end up with
N bytes of headroom and 64 - N bytes of tailroom.  Therefore we can't
rely on that space for XDP headroom.  Make sure we always allocate
full 256 bytes.  This, unfortunately, means we can't fit the headroom
on an u8 any more.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski 85cb207ee3 nfp: don't completely refuse to work with old flashes
Right now the required Service Process ABI version is still tied
to max ID of known commands.  For new NSP commands we are adding
we are checking if NSP version is recent enough on command-by-command
basis.  The driver doesn't have to force the device to have the
very latest flash, anything newer than 0.8 should do.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski d38df0d364 nfp: avoid reading TX queue indexes from the device
Reading TX queue indexes from the device memory on each interrupt
is expensive.  It's doubly expensive with XDP running since we have
two TX rings to check there.  If the software indexes indicate that
the TX queue is completely empty, however, we don't need to look at
the device completion index at all.

The queuing CPU is doing a wmb() before kicking the device TX so
we should be safe to assume on the CPU handling the completions will
never see old value of the software copy of the index.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski 92e68195eb nfp: do simple XDP TX buffer recycling
On the RX path we follow the "drop if allocation of replacement
buffer fails" rule.  With XDP we extended that to the TX action,
so if XDP prog returned TX but allocation of replacement RX buffer
failed, we will drop the packet.

To improve our XDP TX performance extend the idea of rings being
always full to XDP TX rings.  Pre-fill the XDP TX rings with RX
buffers, and when XDP prog returns TX action swap the RX buffer
with the next buffer from the TX ring.

XDP TX complete will no longer free the buffers but let them
sit on the TX ring and wait for swap with RX buffer, instead.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski d78005a50f nfp: drop rx_ring param from buffer allocation
We will soon allocate RX buffers for caching on XDP TX rings.
The rx_ring parameter passed to nfp_net_rx_alloc_one() is not
actually used, remove it.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Jakub Kicinski 46c505188b nfp: replace -ENOTSUPP with -EOPNOTSUPP
As Or points out in commit 423b3aecf2 ("net/mlx4: Change ENOTSUPP
to EOPNOTSUPP"), ENOTSUPP is NFS specific error.  Replace it with
EOPNOTSUPP.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:37:00 -04:00
Willem de Bruijn 1d11e732e7 virtio-net: use netif_tx_napi_add for tx napi
Avoid hashing the tx napi struct into napi_hash[], which is used for
busy polling receive queues.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:33:04 -04:00
David Howells b5082df801 net: Initialise init_net.count to 1
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:32:16 -04:00
Girish Moodalbail 5e0740c445 geneve: fix incorrect setting of UDP checksum flag
Creating a geneve link with 'udpcsum' set results in a creation of link
for which UDP checksum will NOT be computed on outbound packets, as can
be seen below.

11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
    link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
    geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum

Similarly, creating a link with 'noudpcsum' set results in a creation
of link for which UDP checksum will be computed on outbound packets.

Fixes: 9b4437a5b8 ("geneve: Unify LWT and netdev handling.")
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Acked-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:31:16 -04:00
David S. Miller c4879789b1 Merge branch 'vxlan-disabled-ipv6'
Jiri Benc says:

====================
vxlan: do not error out on disabled IPv6

This patchset fixes a bug with metadata based tunnels when booted with
ipv6.disable=1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Jiri Benc baf4d78607 vxlan: do not output confusing error message
The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.

Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Jiri Benc d074bf9600 vxlan: correctly handle ipv6.disable module parameter
When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

Fixes: b1be00a6c3 ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:30:13 -04:00
Andy Shevchenko 90a1bb9816 bnx2x: Get rid of useless temporary variable
Replace pattern

 int status;
 ...
 status = func(...);
 return status;

by

 return func(...);

No functional change intented.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
Andy Shevchenko b77f016726 bnx2x: Reuse bnx2x_null_format_ver()
Reuse bnx2x_null_format_ver() in functions where it's appropriated
instead of open coded variant.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
Andy Shevchenko 55b218c12c bnx2x: Replace custom scnprintf()
Use scnprintf() when printing version instead of custom open coded variants.

Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:28:00 -04:00
David S. Miller 4750c7be5b linux-can-next-for-4.12-20170427
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlkBofATHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAe/sog7Dgc9qY7CACnNQLrwm8fIV72DD9+pZ1QsNwvLHCr
 8ZiHSd7m/VI/FESje4Uc3D9vKi7K6MGX9/lWi70QY+2IPrTfRTrYEO5I4U3+1czc
 zQ/NnQGsKBoAJdPLoXp0LkQVAXRSXUoKiaiU0VdktaAzGewixz9YxGHAcCqcu6oM
 1+MftSR0iCP4mkcOyVMcnipcI9L8TzEOz2f3XB/RtMiRGcEsUekAHfdpH6NSHbAq
 mG5pOwENqNn4i1ZwGg16u3p+U4YfWAF9Y+/wuev6Ob4Kla2tYnuZNEAqf9ZubY93
 NBsT3pGjcbzbbihlKvu64NUlGiloQ8AGAYvttinrfx1tnsDgvP2m6nQP
 =mHEY
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.12-20170427' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2017-04-25

this is a pull request of 1 patch for net-next/master.

This patch by Oliver Hartkopp fixes the build of the broad cast manager
with CONFIG_PROC_FS disabled.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:25:59 -04:00
Chenbo Feng 5d4e344328 bpf: Fix inaccurate helper function description
The description inside uapi/linux/bpf.h about bpf_get_socket_uid
helper function is no longer valid. It returns overflowuid rather
than 0 when failed.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:24:58 -04:00
Davide Caratti d68be71ea1 tcp: fix access to sk->sk_state in tcp_poll()
avoid direct access to sk->sk_state when tcp_poll() is called on a socket
using active TCP fastopen with deferred connect. Use local variable
'state', which stores the result of sk_state_load(), like it was done in
commit 00fd38d938 ("tcp: ensure proper barriers in lockless contexts").

Fixes: 19f6d3f3c8 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:24:16 -04:00
Eric Dumazet d1f496fd8f bpf: restore skb->sk before pskb_trim() call
While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
in skb_try_coalesce() reported by Andrey, I found that we had an skb
with skb->sk set but no skb->destructor.

This invalidated heuristic found in commit 158f323b98 ("net: adjust
skb->truesize in pskb_expand_head()") and in cited patch.

Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
restrain the temporary setting to a minimal section.

[1] https://patchwork.ozlabs.org/patch/755570/
    net: adjust skb->truesize in ___pskb_trim()

Fixes: 8f917bba00 ("bpf: pass sk to helper functions")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:23:16 -04:00
Alexandre Belloni ae3696c167 net: macb: fix phy interrupt parsing
Since 83a77e9ec4, the phydev irq is explicitly set to PHY_POLL when
there is no pdata. It doesn't work on DT enabled platforms because the
phydev irq is already set by libphy before.

Fixes: 83a77e9ec4 ("net: macb: Added PCI wrapper for Platform Driver.")
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 22:21:49 -04:00
David S. Miller 8dd5b698c2 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-04-30

This series contains updates to i40e and i40evf only.

Jake provides majority of the changes in this series, starting with the
renaming of a flag to avoid confusion.  Then renamed a variable to a
more meaningful name to clarify what is actually being done and to
reduce confusion.  Amortizes the wait time when initializing or disabling
lots of VFs by using i40e_reset_all_vfs() and
i40e_vsi_stop_rings_no_wait().  Cleaned up a unnecessary delay since
pci_disable_sriov() already has its own delay, so need to add a additional
delay when removing VFs.  Avoid using the same name flags for both
vsi->state and pf->state, to make code review easier and assist future
work to use the correct state field when checking bits.  Use
DECLARE_BITMAP() to ensure that we always allocate enough space for flags.
Replace hw_disabled_flags with the new _AUTO_DISABLED flags, which are
more readable because we are not setting an *_ENABLED flag to
disable the feature.

Alex corrects a oversight where we were not reprogramming the ports
after a reset, which was causing us to lose all of the receive tunnel
offloads.

Arnd Bergmann moves the declaration of a local variable to avoid a
warning seen on architectures with larger pages about an unused variable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 11:33:37 -04:00
David S. Miller 5b36d8f5e5 Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2017-04-30

This series contains updates to e1000e only.

Jarod Wilson fixes an issue where the workaround for 82574 & 82583
is needed for i218 as well, so set the appropriate flags.

Sasha adds support for the upcoming new i219 devices for the client
platform (CannonLake), which includes the support for 38.4MHz frequency
to support PTP on CannonLake.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-30 11:30:13 -04:00
Marcel Holtmann 71653eb64b Bluetooth: Add selftest for ECDH key generation
Since the ECDH key generation takes a different path, it needs to be
tested as well. For this generate the public debug key from the private
debug key and compare both.

This also moves the seeding of the private key into the SMP calling code
to allow for easier re-use of the ECDH key generation helper.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2017-04-30 16:52:43 +03:00
Marcel Holtmann f958315358 Bluetooth: zero kpp input for key generation
When generating new ECDH keys with kpp, the shared secret input needs to
be set to NULL. Fix this by including kpp_request_set_input call.

Fixes: 58771c1c ("Bluetooth: convert smp and selftest to crypto kpp
API")
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2017-04-30 16:52:39 +03:00
Eli Cohen 0a0ab1d2cc net/mlx5: E-Switch, Avoid redundant memory allocation
struct esw_mc_addr is a small struct that can be part of struct
mlx5_eswitch. Define it as a field and not as a pointer and save the
kzalloc call and then error flow handling.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:21 +03:00
Eran Ben Elisha 0f6e4cf674 net/mlx5e: Disable HW LRO when PCI is slower than link on striding RQ
We will activate the HW LRO only on servers with PCI BW > MAX LINK BW,
or when PCI BW > 16Gbps. On other cases we do not want LRO by default as
LRO sessions might get timeout and add redundant software overhead.

Tested:
	ethtool -k <ifs-name> | grep large-receive-offload
	On systems with and without the limitations.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:20 +03:00
Tariq Toukan b1b03bded1 net/mlx5e: Use u8 as ownership type in mlx5e_get_cqe()
CQE ownership indication is as small as a single bit.
Use u8 to speedup the comparison.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:19 +03:00
Tariq Toukan ad78af9b88 net/mlx5e: Use prefetchw when a write is to follow
"prefetchw()" prefetches the cacheline for write. Use it for
skb->data, as soon we'll be copying the packet header there.

Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for smaller packets, as less time
is spent on handling SKB fragments, making the path shorter
and the improvement more significant.

---------------------------------------------
packet size | before    | after     | gain  |
64B         | 4,113,306 | 4,778,720 |  16%  |
1024B       | 3,633,819 | 3,950,593 | 8.7%  |

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:17 +03:00
Tariq Toukan 1f5b1e47ee net/mlx5e: Optimize poll ICOSQ completion queue
UMR operations are more frequent and important.
Check them first, and add a compiler branch predictor hint.

According to current design, ICOSQ CQ can contain at most one
pending CQE per napi. Poll function is optimized accordingly.

Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for larger packet sizes, as BW is higher
and UMR posts are more frequent.

---------------------------------------------
packet size | before    | after     | gain  |
64B         | 4,092,370 | 4,113,306 |  0.5% |
1024B       | 3,421,435 | 3,633,819 |  6.2% |

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:16 +03:00
Hadar Hen Zion a2fa1fe5ad net/mlx5e: Act on delay probe time updates
The user can change delay_first_probe_time parameter through sysctl.
Listen to NETEVENT_DELAY_PROBE_TIME_UPDATE notifications and update the
intervals for updating the neighbours 'used' value periodic task and
for flow HW counters query periodic task.
Both of the intervals will be update only in case the new delay prob
time value is lower the current interval.

Since the driver saves only one min interval value and not per device,
the users will be able to set lower interval value for updating
neighbour 'used' value periodic task but they won't be able to schedule
a higher interval for this periodic task.
The used interval for scheduling neighbour 'used' value periodic task is
the minimal delay prob time parameter ever seen by the driver.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:15 +03:00
Hadar Hen Zion f6dfb4c3f2 net/mlx5e: Update neighbour 'used' state using HW flow rules counters
When IP tunnel encapsulation rules are offloaded, the kernel can't see
the traffic of the offloaded flow. The neighbour for the IP tunnel
destination of the offloaded flow can mistakenly become STALE and
deleted by the kernel since its 'used' value wasn't changed.

To make sure that a neighbour which is used by the HW won't become
STALE, we proactively update the neighbour 'used' value every
DELAY_PROBE_TIME period, when packets were matched and counted by the HW
for one of the tunnel encap flows related to this neighbour.

The periodic task that updates the used neighbours is scheduled when a
tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
itself as long as the representor's neighbours list isn't empty.

Add, remove, lookup and status change operations done over the
representor's neighbours list or the neighbour hash entry encaps list
are all serialized by RTNL lock.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:14 +03:00
Hadar Hen Zion 232c001398 net/mlx5e: Add support to neighbour update flow
In order to offload TC encap rules, the driver does a lookup for the IP
tunnel neighbour according to the output device and the destination IP
given by the user.

To keep tracking after the validity state of such neighbours, we keep
the neighbours information (pair of device pointer and destination IP)
in a hash table maintained at the relevant egress representor and
register to get NETEVENT_NEIGH_UPDATE events. When getting neighbour update
netevent, we search for a match among the cached neighbours entries used for
encapsulation.

In case the neighbour isn't valid, we can't offload the flow into the
HW. We cache the flow (requested matching and actions) in the driver and
offload the rule later, when the neighbour is resolved and becomes
valid.

When a flow is only cached in the driver and not offloaded into HW
yet, we use EAGAIN return value to mark it internally, the TC ndo still
returns success.

Listen to kernel neighbour update netevents to trace relevant neighbours
validity state:

1. If a neighbour becomes valid, offload the related rules to HW.

2. If the neighbour becomes invalid, remove the related rules from HW.

3. If the neighbour mac address was changed, update the encap header.
   Remove all the offloaded rules using the old encap header from the HW
   and insert new rules to HW with updated encap header.

Access to the neighbors hash table is protected by RTNL lock of its
caller or by the table's spinlock.

Details of the locking/synchronization among the different actions
applied on the neighbour table:

Add/remove operations - protected by RTNL lock of its caller (all TC
commands are protected by RTNL lock). Add and remove operations are
initiated only when the user inserts/removes a TC rule into/from the driver.

Lookup/remove operations - since the lookup operation is done from
netevent notifier block, RTNL lock can't be used (atomic context).
Use the table's spin lock to protect lookups from TC user removal operation.
bh is used since netevent can be called from a softirq context.

Lookup/add operations - The hash table access functions are taking
care of the protection between lookup and add operations.

When adding/removing encap headers and rules to/from the HW, RTNL lock
is used. It can happen when:

1. The user inserts/removes a TC rule into/from the driver (TC commands
are protected by RTNL lock of it's caller).

2. The driver gets neighbour notification event, which reports about
neighbour validity status change. Before adding/removing encap headers
and rules to/from the HW, RTNL lock is taken.

A neighbour hash table entry should be freed when its encap list is empty.
Since The neighbour update netevent notification schedules a neighbour
update work that uses the neighbour hash entry, it can't be freed
unconditionally when the encap list becomes empty during TC delete rule flow.
Use reference count to protect from freeing neighbour hash table entry
while it's still in use.

When the user asks to unregister a netdvice used by one of the neigbours,
neighbour removal notification is received. Then we take a reference on the
neighbour and don't free it until the relevant encap entries (and flows) are
marked as invalid (not offloaded) and removed from HW.
As long as the encap entry is still valid (checked under RTNL lock) we
can safely access the neighbour device saved on mlx5e_neigh struct.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:13 +03:00
Hadar Hen Zion 37b498ff23 net/mlx5e: Add neighbour hash table to the representors
Add hash table to the representors which is to be used by the next patch
to save neighbours information in the driver.

In order to offload IP tunnel encapsulation rules, the driver must find
the tunnel dst neighbour according to the output device and the
destination address given by the user. The next patch will cache the
neighbors information in the driver to allow support in neigh update
flow for tunnel encap rules.

The neighbour entries are also saved in a list so we easily iterate over
them when querying statistics in order to provide 'used' feedback to the
kernel neighbour NUD core.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:12 +03:00
Hadar Hen Zion 033354d501 net/mlx5e: Read neigh parameters with proper locking
The nud_state and hardware address fields are protected by the neighbour
lock, we should acquire it before accessing those parameters.

Use this lock to avoid inconsistency between the neighbour validity state
and it's hardware address.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:10 +03:00
Hadar Hen Zion 0b67a38fe6 net/mlx5e: Use flag to properly monitor a flow rule offloading state
Instead of relaying on the 'flow->rule' pointer value which can be
valid or invalid (in case the FW returns an error while trying to offload
the rule), monitor the rule state using a flag.

In downstream patch which adds support to IP tunneling neigh update
flow, a TC rule could be cached in the driver and not offloaded into the
HW. In this case, the flow handle pointer stays NULL.

Check the offloaded flag to properly deal with rules which are currently
not offloaded when querying rule statistics.

This patch doesn't add any new functionality.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:09 +03:00
Hadar Hen Zion 1a8552bd81 net/mlx5e: Remove output device parameter from create encap header helpers definition
Passing output device parameter to the helper functions that deal with
creation of encapsulation headers is redundant. Output device parameter
can be defined inside those helpers, no need to pass it. Refactor the code by
removing the parameter from the function signature.

This patch doesn't change any functionality.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:08 +03:00
Or Gerlitz c1ae11521b net/mlx5e: Move the encap entry structure from the eswitch header
The encap entry structure isn't manipulated by the eswitch code,
hence it can/needs to be removed from the eswitch header.

Do that, and change it to have mlx5e_ prefix.

This patch doesn't change any functionality.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:07 +03:00
Or Gerlitz 45247bf298 net/mlx5: Remove encap entry pointer from the eswitch flow attributes
Encap wise, the tc eswitch flow attribute struct needs to have
only the encap ID which is programmed later to the HW and none
of the higher level encap params, fix that.

This patch doesn't change any functionality.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30 16:03:06 +03:00
Saeed Mahameed 1d447a3914 net/mlx5e: Extendable vport representor netdev private data
Make representor netdev private data extendable by adding new struct
"mlx5e_rep_priv" and use it as the rep netdev private data struct
instead of directly pointing to mlx5_eswitch_rep.

Added new en_rep.h header file to contain all representor related
definitions and prototypes, and moved all representor specific logic
into en_rep.c.

Needed for downstream patches to extend representor functionality to
support neighbour update.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
2017-04-30 16:02:48 +03:00
Sasha Neftin 68fe1d5da5 e1000e: Add Support for 38.4MHZ frequency
Add support for 38.4MHz frequency is required for PTP
on CannonLake. SYSTIM frequency adjustment attributes for TIMINCA are
get/set dependent on the hardware clock frequency for a different
types of adapters. 38.4MHz frequency supported by CannonLake
and active once time synchronisation mechanism was enabled
Changed abbreviation from Hz to HZ to be compliant checkpatch code style

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30 05:22:32 -07:00
Sasha Neftin c8744f44ae e1000e: Add Support for CannonLake
The propagation of CannonLake mac type to driver functionality

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30 05:18:30 -07:00
Sasha Neftin 3a3173b9c3 e1000e: Initial Support for CannonLake
i219 (6) and i219 (7) are the next LOM generations that will be
available on the nextIntel Client platform (CannonLake)
This patch provides the initial support for these devices

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-04-30 05:15:08 -07:00