Commit Graph

707990 Commits

Author SHA1 Message Date
Jakub Kicinski a2a7d57010 bpf: write back the verifier log buffer as it gets filled
Verifier log buffer can be quite large (up to 16MB currently).
As Eric Dumazet points out if we allow multiple verification
requests to proceed simultaneously, malicious user may use the
verifier as a way of allocating large amounts of unswappable
memory to OOM the host.

Switch to a strategy of allocating a smaller buffer (1024B)
and writing it out into the user buffer after every print.

While at it remove the old BUG_ON().

This is in preparation of the global verifier lock removal.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski d66f2b91f9 bpf: don't rely on the verifier lock for metadata_dst allocation
bpf_skb_set_tunnel_*() functions require allocation of per-cpu
metadata_dst.  The allocation happens upon verification of the
first program using those helpers.  In preparation for removing
the verifier lock, use cmpxchg() to make sure we only allocate
the metadata_dsts once.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski c9c35995bc tools: bpftool: use the kernel's instruction printer
Compile the instruction printer from kernel/bpf and use it
for disassembling "translated" eBPF code.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski f4ac7e0b5c bpf: move instruction printing into a separate file
Separate the instruction printing into a standalone source file.
This way sneaky code from tools/ can compile it in directly.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski 61bd5218ee bpf: move global verifier log into verifier environment
The biggest piece of global state protected by the verifier lock
is the verifier_log.  Move that log to struct bpf_verifier_env.
struct bpf_verifier_env has to be passed now to all invocations
of verbose().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski e7bf8249e8 bpf: encapsulate verifier log state into a structure
Put the loose log_* variables into a structure.  This will make
it simpler to remove the global verifier state in following patches.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Jakub Kicinski a99ca6dbf4 selftests/bpf: add a test for verifier logs
Add a test for verifier log handling.  Check bad attr combinations
but focus on cases when log is truncated.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:30:16 -07:00
Colin Ian King 442d713baa ipv6: fix incorrect bitwise operator used on rt6i_flags
The use of the | operator always leads to true which looks rather
suspect to me. Fix this by using & instead to just check the
RTF_CACHE entry bit.

Detected by CoverityScan, CID#1457734, #1457747 ("Wrong operator used")

Fixes: 35732d01fe ("ipv6: introduce a hash table to store dst cache")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 12:24:15 -07:00
Colin Ian King b2427e6717 ipv6: fix dereference of rt6_ex before null check error
Currently rt6_ex is being dereferenced before it is null checked
hence there is a possible null dereference bug. Fix this by only
dereferencing rt6_ex after it has been null checked.

Detected by CoverityScan, CID#1457749 ("Dereference before null check")

Fixes: 81eb8447da ("ipv6: take care of rt6_stats")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-10 10:54:17 -07:00
Christophe JAILLET 18eb86362a igb: check memory allocation failure
Check memory allocation failures and return -ENOMEM in such cases, as
already done for other memory allocations in this function.

This avoids NULL pointers dereference.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: Aaron Brown <aaron.f.brown@intel.com
Acked-by: PJ Waskiewicz <peter.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 09:01:11 -07:00
Florian Fainelli 377b62736c e1000e: Be drop monitor friendly
e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 09:00:56 -07:00
Willem de Bruijn 48072ae1ec e1000e: apply burst mode settings only on default
Devices that support FLAG2_DMA_BURST have different default values
for RDTR and RADV. Apply burst mode default settings only when no
explicit value was passed at module load.

The RDTR default is zero. If the module is loaded for low latency
operation with RxIntDelay=0, do not override this value with a burst
default of 32.

Move the decision to apply burst values earlier, where explicitly
initialized module variables can be distinguished from defaults.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 09:00:48 -07:00
Sasha Neftin b10effb92e e1000e: fix buffer overrun while the I219 is processing DMA transactions
Intel® 100/200 Series Chipset platforms reduced the round-trip
latency for the LAN Controller DMA accesses, causing in some high
performance cases a buffer overrun while the I219 LAN Connected
Device is processing the DMA transactions. I219LM and I219V devices
can fall into unrecovered Tx hang under very stressfully UDP traffic
and multiple reconnection of Ethernet cable. This Tx hang of the LAN
Controller is only recovered if the system is rebooted. Slightly slow
down DMA access by reducing the number of outstanding requests.
This workaround could have an impact on TCP traffic performance
on the platform. Disabling TSO eliminates performance loss for TCP
traffic without a noticeable impact on CPU performance.

Please, refer to I218/I219 specification update:
https://www.intel.com/content/www/us/en/embedded/products/networking/
ethernet-connection-i218-family-documentation.html

Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Reviewed-by: Raanan Avargil <raanan.avargil@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 09:00:38 -07:00
Benjamin Poirier 4aea7a5c5e e1000e: Avoid receiver overrun interrupt bursts
When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.

This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.

Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59bc
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
0a8047ac68 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.

Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Fixes: 16ecba59bc ("e1000e: Do not read ICR in Other interrupt")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 08:59:22 -07:00
Benjamin Poirier 19110cfbb3 e1000e: Separate signaling for link check/link up
Lennart reported the following race condition:

\ e1000_watchdog_task
    \ e1000e_has_link
        \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
            /* link is up */
            mac->get_link_status = false;

                            /* interrupt */
                            \ e1000_msix_other
                                hw->mac.get_link_status = true;

        link_active = !hw->mac.get_link_status
        /* link_active is false, wrongly */

This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.

Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().

Reported-by: Lennart Sorensen <lsorense@csclub.uwaterloo.ca>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 08:35:01 -07:00
Benjamin Poirier d3509f8bc7 e1000e: Fix return value test
All the helpers return -E1000_ERR_PHY.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 08:33:24 -07:00
Benjamin Poirier 65a29da1f5 e1000e: Fix wrong comment related to link detection
Reading e1000e_check_for_copper_link() shows that get_link_status is set to
false after link has been detected. Therefore, it stays TRUE until then.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 08:27:07 -07:00
Benjamin Poirier c4c40e51f9 e1000e: Fix error path in link detection
In case of error from e1e_rphy(), the loop will exit early and "success"
will be set to true erroneously.

Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-10 08:17:00 -07:00
Bernd Edlinger 812b5ca7d3 Add a driver for Renesas uPD60620 and uPD60620A PHYs
Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 20:49:36 -07:00
Willem de Bruijn 1e6f74536d vhost_net: do not stall on zerocopy depletion
Vhost-net has a hard limit on the number of zerocopy skbs in flight.
When reached, transmission stalls. Stalls cause latency, as well as
head-of-line blocking of other flows that do not use zerocopy.

Instead of stalling, revert to copy-based transmission.

Tested by sending two udp flows from guest to host, one with payload
of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
large flow is redirected to a netem instance with 1MBps rate limit
and deep 1000 entry queue.

  modprobe ifb
  ip link set dev ifb0 up
  tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit

  tc qdisc add dev tap0 ingress
  tc filter add dev tap0 parent ffff: protocol ip \
      u32 match ip dport 8000 0xffff \
      action mirred egress redirect dev ifb0

Before the delay, both flows process around 80K pps. With the delay,
before this patch, both process around 400. After this patch, the
large flow is still rate limited, while the small reverts to its
original rate. See also discussion in the first link, below.

Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to
send at 100% zerocopy.

The limit in vhost_exceeds_maxpend must be carefully chosen. With
vq->num >> 1, the flows remain correlated. This value happens to
correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
fractions and ensure correctness also for much smaller values of
vq->num, by testing the min() of both explicitly. See also the
discussion in the second link below.

Changes
  v1 -> v2
    - replaced min with typed min_t
    - avoid unnecessary whitespace change

Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com
Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 20:46:39 -07:00
William Tu ceaa001a17 openvswitch: Add erspan tunnel support.
Add erspan netlink interface for OVS.

Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 20:45:50 -07:00
Eric Biggers cf4c950b87 once: switch to new jump label API
Switch the DO_ONCE() macro from the deprecated jump label API to the new
one.  The new one is more readable, and for DO_ONCE() it also makes the
generated code more icache-friendly: now the one-time initialization
code is placed out-of-line at the jump target, rather than at the inline
fallthrough case.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 20:26:23 -07:00
David S. Miller d93fa2ba64 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-09 20:11:09 -07:00
Wei Wang d0e60206be ipv6: use rcu_dereference_bh() in ipv6_route_seq_next()
This patch replaces rcu_deference() with rcu_dereference_bh() in
ipv6_route_seq_next() to avoid the following warning:

[   19.431685] WARNING: suspicious RCU usage
[   19.433451] 4.14.0-rc3-00914-g66f5d6c #118 Not tainted
[   19.435509] -----------------------------
[   19.437267] net/ipv6/ip6_fib.c:2259 suspicious
rcu_dereference_check() usage!
[   19.440790]
[   19.440790] other info that might help us debug this:
[   19.440790]
[   19.444734]
[   19.444734] rcu_scheduler_active = 2, debug_locks = 1
[   19.447757] 2 locks held by odhcpd/3720:
[   19.449480]  #0:  (&p->lock){+.+.}, at: [<ffffffffb1231f7d>]
seq_read+0x3c/0x333
[   19.452720]  #1:  (rcu_read_lock_bh){....}, at: [<ffffffffb1d2b984>]
ipv6_route_seq_start+0x5/0xfd
[   19.456323]
[   19.456323] stack backtrace:
[   19.458812] CPU: 0 PID: 3720 Comm: odhcpd Not tainted
4.14.0-rc3-00914-g66f5d6c #118
[   19.462042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1 04/01/2014
[   19.465414] Call Trace:
[   19.466788]  dump_stack+0x86/0xc0
[   19.468358]  lockdep_rcu_suspicious+0xea/0xf3
[   19.470183]  ipv6_route_seq_next+0x71/0x164
[   19.471963]  seq_read+0x244/0x333
[   19.473522]  proc_reg_read+0x48/0x67
[   19.475152]  ? proc_reg_write+0x67/0x67
[   19.476862]  __vfs_read+0x26/0x10b
[   19.478463]  ? __might_fault+0x37/0x84
[   19.480148]  vfs_read+0xba/0x146
[   19.481690]  SyS_read+0x51/0x8e
[   19.483197]  do_int80_syscall_32+0x66/0x15a
[   19.484969]  entry_INT80_compat+0x32/0x50
[   19.486707] RIP: 0023:0xf7f0be8e
[   19.488244] RSP: 002b:00000000ffa75d04 EFLAGS: 00000246 ORIG_RAX:
0000000000000003
[   19.491431] RAX: ffffffffffffffda RBX: 0000000000000009 RCX:
0000000008056068
[   19.493886] RDX: 0000000000001000 RSI: 0000000008056008 RDI:
0000000000001000
[   19.496331] RBP: 00000000000001ff R08: 0000000000000000 R09:
0000000000000000
[   19.498768] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000000000
[   19.501217] R13: 0000000000000000 R14: 0000000000000000 R15:
0000000000000000

Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 19:59:42 -07:00
Linus Torvalds 529a86e063 Merge branch 'ppc-bundle' (bundle from Michael Ellerman)
Merge powerpc transactional memory fixes from Michael Ellerman:
 "I figured I'd still send you the commits using a bundle to make sure
  it works in case I need to do it again in future"

This fixes transactional memory state restore for powerpc.

* bundle'd patches from Michael Ellerman:
  powerpc/tm: Fix illegal TM state in signal handler
  powerpc/64s: Use emergency stack for kernel TM Bad Thing program checks
2017-10-09 19:08:32 -07:00
David S. Miller 9f7be893ab Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-10-09

This series contains updates to i40e and i40evf only.

Jake fixes missed flag conversion from u64 to u32.  Fixes a deafult ITR
value issue where the driver defaults to an ITR value of half the
expected value (in terms of minimum microseconds between interrupts).  So
fix this by changing the default values to be calculated using the
ITR_REG_TO_USEC() macro which indicates that we are converting from the
register units into microseconds. Updates the drivers to bump the tail in
increments of 8 and double the number of descriptors we will bundle into
one tail bump when receiving.  With the recent kernel support for
enabling XPS and QoS at the same time, we no longer need to worry about
the number of traffic classes when enabling XPS.

Lihong converts the use of hash_for_each() to hash_for_each_safe() to
safely remove a hash entry.  Adds a check for the return value for
find_first_bit() in the case that it returns the size passed to search.

Alan fixes a bug in which filters are erroneously removed if they are
removed and then added again.  So make sure that when adding a filter, if
we find it already existed in our list, make sure it is not marked to be
removed.

Jayaprakash adds the retrying of PHY reads when the I2C is busy for a
maximum period of 500ms.

Rami fixes code comment typo.

Stefano Brivio simplifies the code by removing the use of a local
return code variable and simply return the results of the read function.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 18:12:03 -07:00
David S. Miller 0349a86c85 Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2017-10-09

This series contains updates to ixgbe only.

Emil fixes an issue where the semaphore bits could be stuck after a reset
or a crash, by adding the clearing of software resource bits in the
software/firmware synchronization register.  Added error checks when we
attempt to identify and initialize the PHY to prevent a crash.  Fixed a
few issues in the logic of ixgbe_clean_test_rings() which was exposed by
a previous commit that was causing a crash in ethtool diagnostics.

Bhumika Goyal fixes a couple of instances which were overlooked when we
made ixgbe_mac_operations constant.

Shannon Nelson fixes an issue to restore normal operations after the
last MACVLAN offload is removed, otherwise we get stuck in a single queue
operations.

The infamous Jesper Dangaard Brouer adds a counter which counts the
number of times the recycle fails and the real page allocator is invoked.

Alex updates the adaptive ITR algorithm to better support the needs of the
network.  This attempt to make it so that our ITR algorithm will try to
prevent either starving a socket buffer for memory in the case of
transmit, or overrunning an receive socket buffer on receive.  We should
function better with new features like XDP which can handle small packets
at high rates without needing to lock us into NAPI polling mode.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 16:38:52 -07:00
Linus Torvalds ff33952e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix object leak on IPSEC offload failure, from Steffen Klassert.

 2) Fix range checks in ipset address range addition operations, from
    Jozsef Kadlecsik.

 3) Fix pernet ops unregistration order in ipset, from Florian Westphal.

 4) Add missing netlink attribute policy for nl80211 packet pattern
    attrs, from Peng Xu.

 5) Fix PPP device destruction race, from Guillaume Nault.

 6) Write marks get lost when BPF verifier processes R1=R2 register
    assignments, causing incorrect liveness information and less state
    pruning. Fix from Alexei Starovoitov.

 7) Fix blockhole routes so that they are marked dead and therefore not
    cached in sockets, otherwise IPSEC stops working. From Steffen
    Klassert.

 8) Fix broadcast handling of UDP socket early demux, from Paolo Abeni.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (37 commits)
  cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
  net: thunderx: mark expected switch fall-throughs in nicvf_main()
  udp: fix bcast packet reception
  netlink: do not set cb_running if dump's start() errs
  ipv4: Fix traffic triggered IPsec connections.
  ipv6: Fix traffic triggered IPsec connections.
  ixgbe: incorrect XDP ring accounting in ethtool tx_frame param
  net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  Revert commit 1a8b6d76dc ("net:add one common config...")
  ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register
  ixgbe: Return error when getting PHY address if PHY access is not supported
  netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'
  netfilter: SYNPROXY: skip non-tcp packet in {ipv4, ipv6}_synproxy_hook
  tipc: Unclone message at secondary destination lookup
  tipc: correct initialization of skb list
  gso: fix payload length when gso_size is zero
  mlxsw: spectrum_router: Avoid expensive lookup during route removal
  bpf: fix liveness marking
  doc: Fix typo "8023.ad" in bonding documentation
  ipv6: fix net.ipv6.conf.all.accept_dad behaviour for real
  ...
2017-10-09 16:25:00 -07:00
Aleksander Morgado fdfbad3256 cdc_ether: flag the u-blox TOBY-L2 and SARA-U2 as wwan
The u-blox TOBY-L2 is a LTE Cat 4 module with HSPA+ and 2G fallback.
This module allows switching to different USB profiles with the
'AT+UUSBCONF' command, and provides a ECM network interface when the
'AT+UUSBCONF=2' profile is selected.

The u-blox SARA-U2 is a HSPA module with 2G fallback. The default USB
configuration includes a ECM network interface.

Both these modules are controlled via AT commands through one of the
TTYs exposed. Connecting these modules may be done just by activating
the desired PDP context with 'AT+CGACT=1,<cid>' and then running DHCP
on the ECM interface.

Signed-off-by: Aleksander Morgado <aleksander@aleksander.es>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 16:03:32 -07:00
Stefano Brivio 2c4d36b708 i40e: Avoid some useless variables and initializers in NVM functions
Fixes: 09f79fd49d ("i40e: avoid NVM acquire deadlock during NVM update")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:42:17 -07:00
Rami Rosen 3d7d7a86ec i40e: fix a typo
This patch fixes a typo in i40e_vsi_alloc_arrays() documentation.
The first parameter name should be "vsi" instead of "type".

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:39:49 -07:00
Lihong Yang 9bcc07f065 i40e: use a local variable instead of calculating multiple times
The computed result of I40E_MAX_VSI_QP * I40E_VIRTCHNL_SUPPORTED_QTYPES
is used more than three times in function i40e_config_irq_link_list.
Simply declare a local variable to store it to improve readability.

Signed-off-by: Lihong Yang <lihong.yang@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:38:04 -07:00
Jayaprakash Shanmugam 4988410f8d i40e: Retry AQC GetPhyAbilities to overcome I2CRead hangs
- When the I2C is busy, the PHY reads are delayed.  The firmware will
  return EGAIN in these cases with an expectation that the SW will
  trigger the reads again
- This patch retries the operation for a maximum period of 500ms

Signed-off-by: Jayaprakash Shanmugam <jayaprakash.shanmugam@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:32:18 -07:00
Lihong Yang b861fb762a i40e: add check for return from find_first_bit call
The find_first_bit function will return the size passed to search
if the first set bit is not found. This patch adds the check in case
that happens as the return value would be used as the index in an array
and that would have caused the out-of-bounds access.

Detected by CoverityScan, CID 1295969 Out-of-bounds access

Signed-off-by: Lihong Yang <lihong.yang@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:30:41 -07:00
Jacob Keller 6f853d4f8e i40e: allow XPS with QoS enabled
Recently, the kernel gained support for enabling XPS and QoS at the
same time. Thus, we no longer need to worry about the number of
traffic classes when enabling XPS.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:28:56 -07:00
Jacob Keller 95bc2fb4c6 i40e/i40evf: bundle more descriptors when allocating buffers
Double the number of descriptors we'll bundle into one tail bump when
receiving. Empirical testing has shown that we reduce CPU utilization
and don't appear to reduce throughput or packet rate. 32 seems to be the
sweet spot, as it's half the default polling budget, so we'd essentially
reduce from 4 tail writes when polling down to 2. Increasing this up to
64 appears to have negative impacts as it may become possible that we
don't bump the tail each time we get polled, which could cause a long
delay between returning descriptors to the hardware.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:27:42 -07:00
Jacob Keller 11f29003d6 i40e/i40evf: bump tail only in multiples of 8
Hardware only fetches descriptors on cachelines of 8, essentially
ignoring the lower 3 bits of the tail register. Thus, it is pointless to
bump tail by an unaligned access as the hardware will ignore some of the
new descriptors we allocated. Thus, it's ideal if we can ensure tail
writes are always aligned to 8.

At first, it seems like we'd already do this, since we allocate
descriptors in batches which are a multiple of 8. Since we'd always
increment by a multiple of 8, it seems like the value should always be
aligned.

However, this ignores allocation failures. If we fail to allocate
a buffer, our tail register will become unaligned. Once it has become
unaligned it will essentially be stuck unaligned until a buffer
allocation happens to fail at the exact amount necessary to re-align it.

We can do better, by simply rounding down the number of buffers we're
about to allocate (cleaned_count) such that "next_to_clean
+ cleaned_count" is rounded to the nearest multiple of 8.

We do this by calculating how far off that value is and subtracting it
from the cleaned_count. This essentially defers allocation of buffers if
they're going to be ignored by hardware anyways, and re-aligns our
next_to_use and tail values after a failure to allocate a descriptor.

This calculation ensures that we always align the tail writes in a way
the hardware expects and don't unnecessarily allocate buffers which
won't be fetched immediately.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:26:29 -07:00
Jacob Keller 7362be9eee i40e: reduce lrxqthresh from 2 to 1
The lrxq thresh value tells hardware to immediately interrupt when there
are fewer than N*64 packets left in the ring.

Counter intuitively, empirical testing has shown that decreasing this
value from 2 to 1, and thus changing from an immediate interrupt at
fewer than 128 descriptors down to 64 descriptors causes a small
increase in the maximum total packets per second we can receive. This
increase occurs even when we're polling with interrupts masked, as the
hardware must still handle interrupts internally even if we've disabled
them in software.

Also reduce the value for any VFs we allocate.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:24:46 -07:00
Jacob Keller dbadbbe235 i40e/i40evf: always set the CLEARPBA flag when re-enabling interrupts
In the past we changed driver behavior to not clear the PBA when
re-enabling interrupts. This change was motivated by the flawed belief
that clearing the PBA would cause a lost interrupt if a receive
interrupt occurred while interrupts were disabled.

According to empirical testing this isn't the case. Additionally, the
data sheet specifically says that we should set the CLEARPBA bit when
re-enabling interrupts in a polling setup.

This reverts commit 40d72a5098 ("i40e/i40evf: don't lose interrupts")

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:21:26 -07:00
Jacob Keller 4270255929 i40e/i40evf: fix incorrect default ITR values on driver load
The ITR register expects to be programmed in units of 2 microseconds.
Because of this, all of the drivers I40E_ITR_* constants are in terms of
this 2 microsecond register.

Unfortunately, the rx_itr_default value is expected to be programmed in
microseconds.

Effectively the driver defaults to an ITR value of half the expected
value (in terms of minimum microseconds between interrupts).

Fix this by changing the default values to be calculated using
ITR_REG_TO_USEC macro which indicates that we're converting from the
register units into microseconds.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:19:46 -07:00
Alan Brady c766b9af9a i40evf: fix mac filter removal timing issue
Due to the asynchronous nature in which mac filters are added and
deleted, there exists a bug in which filters are erroneously removed if
removed then added again quickly.

The events are as such:
    - filter marked for removal
    - same filter is re-added before watchdog that cleans up filters
    - we skip re-adding the filter because we have it already in the
list
    - watchdog filter cleanup kicks off and filter is removed

So when we were re-adding the same filter, it didn't actually get added
because it already existed in the list, but was marked for removal and
had yet to actually be removed.

This patch fixes the issue by making sure that when adding a filter, if
we find it already existing in our list, make sure it is not marked to
be removed.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:17:03 -07:00
Lihong Yang 784548c40d i40e: use the safe hash table iterator when deleting mac filters
This patch replaces hash_for_each function with hash_for_each_safe
when calling  __i40e_del_filter. The hash_for_each_safe function is
the right one to use when iterating over a hash table to safely remove
a hash entry. Otherwise, incorrect values may be read from freed memory.

Detected by CoverityScan, CID 1402048 Read from pointer after free

Signed-off-by: Lihong Yang <lihong.yang@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 14:12:54 -07:00
Jacob Keller b48be9978e i40e: fix flags declaration
Since we don't yet have more than 32 flags, we'll use a u32 for both the
hw_features and flag field. Should we gain more flags in the future, we
may need to convert to a u64 or separate flags out into two fields.

This was overlooked in the previous commit 2781de2134c4 ("i40e/i40evf:
organize and re-number feature flags"), where the feature flag was not
converted form u64 to u32.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2017-10-09 12:37:28 -07:00
Linus Torvalds 68ebe3cbe7 NFS client bugfixes for Linux 4.14
Hightlights include:
 
 stable fixes:
 - nfs/filelayout: fix oops when freeing filelayout segment
 - NFS: Fix uninitialized rpc_wait_queue
 
 bugfixes:
 - NFSv4/pnfs: Fix an infinite layoutget loop
 - nfs: RPC_MAX_AUTH_SIZE is in bytes
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZ27KKAAoJEGcL54qWCgDybIIP/Ai9g9AQ52B7Id0VhcB40fZM
 Bn8I8nYbSzkOivL+w5DHW5eTg0spJ2+iEBjOucPkDWuK0hmeu7kDaIIfauaBTmcM
 dg2eQMVEaU8PnB0Bf9xMF1hR4Jf3laPVaW3Dnpl01+eJu0feQVf3EDJOzwDll5e6
 GDt8wuKXjfXZmHEVuvMvD/YSbzlLgKIyp62VRWXWMM73VUHL9YNc0VDaX6LTHzkM
 fYK+jWEgoq93/xuC2cP98+PyoziL82AYl7em0mcHTeffHm6FlB2KXrQq6dsW3UqI
 QMHQdqn6j+CWAv/PyJP+AifT/pTlvnor9ia4TVXlleWwrMSllUDCEttWi0jaBJxv
 OhaQgaQQEIGb6TLo7qbmHIX/VXxC1UMfjkx1Eqr4vu/Ps8y9t1Wy6V+pd86+QbzG
 qo/+jtFVHTMWIU9JBlowKoAJkeyeMfhL4cfSqcgdsSj9JJ2O/F/a/BFNh3bgui69
 TeSFLMoS0FCw9T2h2QeMCSwXvETmFDZR2pUXdsoULxYH0jZ4oPr7Fr9GflsSITwA
 oCITgkpt1oOoB5V/PrLPWfjq0JzcA69VAgmD1WJn5eNz1AvQErYYNU+VDf51T4rm
 zEAxk26WB7+KBBYMEyRCBeatnAAx0a28MFyYI7ittwovOkXIXOv/dw2bFZbSNyoc
 vpe4ZMGP442znvyy5Myh
 =QOH4
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include:

  stable fixes:
   - nfs/filelayout: fix oops when freeing filelayout segment
   - NFS: Fix uninitialized rpc_wait_queue

  bugfixes:
   - NFSv4/pnfs: Fix an infinite layoutget loop
   - nfs: RPC_MAX_AUTH_SIZE is in bytes"

* tag 'nfs-for-4.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4/pnfs: Fix an infinite layoutget loop
  nfs/filelayout: fix oops when freeing filelayout segment
  sunrpc: remove redundant initialization of sock
  NFS: Fix uninitialized rpc_wait_queue
  NFS: Cleanup error handling in nfs_idmap_request_key()
  nfs: RPC_MAX_AUTH_SIZE is in bytes
2017-10-09 10:55:37 -07:00
David S. Miller 2e997d8b12 Merge branch 'ipv6-addrlabel-avoid-dirtying-ip6addrlbl_entry'
Eric Dumazet says:

====================
ipv6: addrlabel: avoid dirtying ip6addrlbl_entry

The refcount on ip6addrlbl_entry is only used to make sure ip6addrlbl_entry
does not disappear while ip6addrlbl_get() is allocating an skb.

We can instead allocate skb first, then use RCU, so that we no longer need
to refcount these structures.
====================

Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:47:30 -07:00
Eric Dumazet 2809c0957d ipv6: addrlabel: remove refcounting
After previous patch ("ipv6: addrlabel: rework ip6addrlbl_get()")
we can remove the refcount from struct ip6addrlbl_entry,
since it is no longer elevated in p6addrlbl_get()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:47:30 -07:00
Eric Dumazet 66c77ff3a0 ipv6: addrlabel: rework ip6addrlbl_get()
If we allocate skb before the lookup, we can use RCU
without the need of ip6addrlbl_hold()

This means that the following patch can get rid of refcounting.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:47:30 -07:00
Gustavo A. R. Silva 1a2ace56ce net: thunderx: mark expected switch fall-throughs in nicvf_main()
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Robert Richter <rric@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: netdev@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:43:03 -07:00
David S. Miller fb60bccc06 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:

1) Fix packet drops due to incorrect ECN handling in IPVS, from Vadim
   Fedorenko.

2) Fix splat with mark restoration in xt_socket with non-full-sock,
   patch from Subash Abhinov Kasiviswanathan.

3) ipset bogusly bails out when adding IPv4 range containing more than
   2^31 addresses, from Jozsef Kadlecsik.

4) Incorrect pernet unregistration order in ipset, from Florian Westphal.

5) Races between dump and swap in ipset results in BUG_ON splats, from
   Ross Lagerwall.

6) Fix chain renames in nf_tables, from JingPiao Chen.

7) Fix race in pernet codepath with ebtables table registration, from
   Artem Savkov.

8) Memory leak in error path in set name allocation in nf_tables, patch
   from Arvind Yadav.

9) Don't dump chain counters if they are not available, this fixes a
   crash when listing the ruleset.

10) Fix out of bound memory read in strlcpy() in x_tables compat code,
    from Eric Dumazet.

11) Make sure we only process TCP packets in SYNPROXY hooks, patch from
    Lin Zhang.

12) Cannot load rules incrementally anymore after xt_bpf with pinned
    objects, added in revision 1. From Shmulik Ladkani.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:39:52 -07:00
David S. Miller 5766cd68f6 Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2017-10-09

This series contains updates to ixgbe and arch/Kconfig.

Mark fixes a case where PHY register access is not supported and we were
returning a PHY address, when we should have been returning -EOPNOTSUPP.

Sabrina Dubroca fixes the use of a logical "and" when it should have been
the bitwise "and" operator.

Ding Tianhong reverts the commit that added the Kconfig bool option
ARCH_WANT_RELAX_ORDER, since there is now a new flag
PCI_DEV_FLAGS_NO_RELAXED_ORDERING that has been added to indicate that
Relaxed Ordering Attributes should not be used for Transaction Layer
Packets.  Then follows up with making the needed changes to ixgbe to
use the new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag.

John Fastabend fixes an issue in the ring accounting when the transmit
ring parameters are changed via ethtool when an XDP program is attached.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09 10:36:25 -07:00