Commit Graph

1106 Commits

Author SHA1 Message Date
Jack Morgenstein 4cbe4dac82 net/mlx4_core: Avoid delays during VF driver device shutdown
Some Hypervisors detach VFs from VMs by instantly causing an FLR event
to be generated for a VF.

In the mlx4 case, this will cause that VF's comm channel to be disabled
before the VM has an opportunity to invoke the VF device's "shutdown"
method.

For such Hypervisors, there is a race condition between the VF's
shutdown method and its internal-error detection/reset thread.

The internal-error detection/reset thread (which runs every 5 seconds) also
detects a disabled comm channel. If the internal-error detection/reset
flow wins the race, we still get delays (while that flow tries repeatedly
to detect comm-channel recovery).

The cited commit fixed the command timeout problem when the
internal-error detection/reset flow loses the race.

This commit avoids the unneeded delays when the internal-error
detection/reset flow wins.

Fixes: d585df1c5c ("net/mlx4_core: Avoid command timeouts during VF driver device shutdown")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reported-by: Simon Xiao <sixiao@microsoft.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:14:51 -07:00
Eric Dumazet 47d3a07528 net/mlx4_en: fix overflow in mlx4_en_init_timestamp()
The cited commit makes a great job of finding optimal shift/multiplier
values assuming a 10 seconds wrap around, but forgot to change the
overflow_period computation.

It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
which is silly.

Lets simply use 5 seconds, no need to recompute this, given how it is
supposed to work.

Later, we will use a timer instead of a work queue, since the new RX
allocation schem will no longer need mlx4_en_recover_from_oom() and the
service_task firing every 250 ms.

Fixes: 31c128b66e ("net/mlx4_en: Choose time-stamping shift value according to HW frequency")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-26 15:39:43 -05:00
Eric Dumazet 7f0137e2ef net/mlx4_en: Use __skb_fill_page_desc()
Or we might miss the fact that a page was allocated from memory reserves.

Fixes: dceeab0e52 ("mlx4: support __GFP_MEMALLOC for rx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Jack Morgenstein 6ed63d845e net/mlx4_core: Use cq quota in SRIOV when creating completion EQs
When creating EQs to handle CQ completion events for the PF
or for VFs, we create enough EQE entries to handle completions
for the max number of CQs that can use that EQ.

When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource tracker).
Therefore, when creating an EQ, the number of EQE entries that the VF
should request for that EQ is the CQ quota value (and not the total
number of CQs available in the FW).

Under SRIOV, the PF, also must use its CQ quota, because
the resource tracker also controls how many CQs the PF can obtain.

Using the FW total CQs instead of the CQ quota when creating EQs resulted
wasting MTT entries, due to allocating more EQEs than were needed.

Fixes: 5a0d0a6161 ("mlx4: Structures and init/teardown for VF resource quotas")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Reported-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Majd Dibbiny 95f1ba9a24 net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs
In the VF driver, module parameter mlx4_log_num_mgm_entry_size was
mistakenly overwritten -- and in a manner which overrode the
device-managed flow steering option encoded in the parameter.

log_num_mgm_entry_size is a global module parameter which
affects all ConnectX-3 PFs installed on that host.
If a VF changes log_num_mgm_entry_size, this will affect all PFs
which are probed subsequent to the change (by disabling DMFS for
those PFs).

Fixes: 3c439b5586 ("mlx4_core: Allow choosing flow steering mode")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:57 -05:00
Eugenia Emantayev 745d8ae462 net/mlx4: Spoofcheck and zero MAC can't coexist
Spoofcheck can't be enabled if VF MAC is zero.
Vice versa, can't zero MAC if spoofcheck is on.

Fixes: 8f7ba3ca12 ('net/mlx4: Add set VF mac address support')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:56 -05:00
Or Gerlitz 423b3aecf2 net/mlx4: Change ENOTSUPP to EOPNOTSUPP
As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlx4 driver.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Suggested-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 10:57:56 -05:00
Eric Dumazet 3608b13ccc mlx4: reduce OOM risk on arches with large pages
Since mlx4 NIC are used on PowerPC with 64K pages, we need to adapt
MLX4_EN_ALLOC_PREFER_ORDER definition.

Otherwise, a fragment sitting in an out of order TCP queue can hold
0.5 Mbytes and it is a serious OOM risk.

Fixes: 51151a16a6 ("mlx4: allow order-0 memory allocations in RX path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-20 10:31:50 -05:00
Eric Dumazet f5a5772337 mlx4: fix potential divide by 0 in mlx4_en_auto_moderation()
1) In the case where rate == priv->pkt_rate_low == priv->pkt_rate_high,
mlx4_en_auto_moderation() does a divide by zero.

2) We want to properly change the moderation parameters if rx_frames
was changed (like in ethtool -C eth0 rx-frames 16)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-19 18:15:23 -05:00
Eric Dumazet 01f0f42534 mlx4: do not fire tasklet unless necessary
All rx and rx netdev interrupts are handled by respectively
by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.

But mlx4_eq_int() also fires a tasklet to service all items that were
queued via mlx4_add_cq_to_tasklet(), but this handler was not called
unless user cqe was handled.

This is very confusing, as "mpstat -I SCPU ..." show huge number of
tasklet invocations.

This patch saves this overhead, by carefully firing the tasklet directly
from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 10:55:22 -05:00
Eric Dumazet 99f5711e7c mlx4: do not use rwlock in fast path
Using a reader-writer lock in fast path is silly, when we can
instead use RCU or a seqlock.

For mlx4 hwstamp clock, a seqlock is the way to go, removing
two atomic operations and false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-15 12:06:18 -05:00
David S. Miller 3efa70d78f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 16:29:30 -05:00
Benjamin Poirier bd4ce941c8 mlx4: Invoke softirqs after napi_reschedule
mlx4 may schedule napi from a workqueue. Afterwards, softirqs are not run
in a deterministic time frame and the following message may be logged:
NOHZ: local_softirq_pending 08

The problem is the same as what was described in commit ec13ee8014
("virtio_net: invoke softirqs after __napi_schedule") and this patch
applies the same fix to mlx4.

Fixes: 07841f9d94 ("net/mlx4_en: Schedule napi when RX buffers allocation fails")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 12:50:43 -05:00
Dan Carpenter 73cfb2a2e4 net/mlx4_en: fix a condition
There is a "||" vs "|" typo here so we test 0x1 instead of 0x6.

Fixes: 1f8176f735 ("net/mlx4_en: Check the enabling pptx/pprx flags in SET_PORT wrapper flow")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-06 12:01:06 -05:00
Martin KaFai Lau 770f82253d mlx4: xdp_prog becomes inactive after ethtool '-L' or '-G'
After calling mlx4_en_try_alloc_resources (e.g. by changing the
number of rx-queues with ethtool -L), the existing xdp_prog becomes
inactive.

The bug is that the xdp_prog ptr has not been carried over from
the old rx-queues to the new rx-queues

Fixes: 47a38e1550 ("net/mlx4_en: add support for fast rx drop bpf program")
Cc: Brenden Blanco <bblanco@plumgrid.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:27:05 -05:00
Martin KaFai Lau f32b20e89e mlx4: Fix memory leak after mlx4_en_update_priv()
In mlx4_en_update_priv(), dst->tx_ring[t] and dst->tx_cq[t]
are over-written by src->tx_ring[t] and src->tx_cq[t] without
first calling kfree.

One of the reproducible code paths is by doing 'ethtool -L'.

The fix is to do the kfree in mlx4_en_free_resources().

Here is the kmemleak report:
unreferenced object 0xffff880841211800 (size 2048):
  comm "ethtool", pid 3096, jiffies 4294716940 (age 528.353s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81930718>] kmemleak_alloc+0x28/0x50
    [<ffffffff8120b213>] kmem_cache_alloc_trace+0x103/0x260
    [<ffffffff8170e0a8>] mlx4_en_try_alloc_resources+0x118/0x1a0
    [<ffffffff817065a9>] mlx4_en_set_ringparam+0x169/0x210
    [<ffffffff818040c5>] dev_ethtool+0xae5/0x2190
    [<ffffffff8181b898>] dev_ioctl+0x168/0x6f0
    [<ffffffff817d7a72>] sock_do_ioctl+0x42/0x50
    [<ffffffff817d819b>] sock_ioctl+0x21b/0x2d0
    [<ffffffff81247a73>] do_vfs_ioctl+0x93/0x6a0
    [<ffffffff812480f9>] SyS_ioctl+0x79/0x90
    [<ffffffff8193d7ea>] entry_SYSCALL_64_fastpath+0x18/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff880841213000 (size 2048):
  comm "ethtool", pid 3096, jiffies 4294716940 (age 528.353s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81930718>] kmemleak_alloc+0x28/0x50
    [<ffffffff8120b213>] kmem_cache_alloc_trace+0x103/0x260
    [<ffffffff8170e0cb>] mlx4_en_try_alloc_resources+0x13b/0x1a0
    [<ffffffff817065a9>] mlx4_en_set_ringparam+0x169/0x210
    [<ffffffff818040c5>] dev_ethtool+0xae5/0x2190
    [<ffffffff8181b898>] dev_ioctl+0x168/0x6f0
    [<ffffffff817d7a72>] sock_do_ioctl+0x42/0x50
    [<ffffffff817d819b>] sock_ioctl+0x21b/0x2d0
    [<ffffffff81247a73>] do_vfs_ioctl+0x93/0x6a0
    [<ffffffff812480f9>] SyS_ioctl+0x79/0x90
    [<ffffffff8193d7ea>] entry_SYSCALL_64_fastpath+0x18/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff

(gdb) list *mlx4_en_try_alloc_resources+0x118
0xffffffff8170e0a8 is in mlx4_en_try_alloc_resources (drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2145).
2140                    if (!dst->tx_ring_num[t])
2141                            continue;
2142
2143                    dst->tx_ring[t] = kzalloc(sizeof(struct mlx4_en_tx_ring *) *
2144                                              MAX_TX_RINGS, GFP_KERNEL);
2145                    if (!dst->tx_ring[t])
2146                            goto err_free_tx;
2147
2148                    dst->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
2149                                            MAX_TX_RINGS, GFP_KERNEL);
(gdb) list *mlx4_en_try_alloc_resources+0x13b
0xffffffff8170e0cb is in mlx4_en_try_alloc_resources (drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2150).
2145                    if (!dst->tx_ring[t])
2146                            goto err_free_tx;
2147
2148                    dst->tx_cq[t] = kzalloc(sizeof(struct mlx4_en_cq *) *
2149                                            MAX_TX_RINGS, GFP_KERNEL);
2150                    if (!dst->tx_cq[t]) {
2151                            kfree(dst->tx_ring[t]);
2152                            goto err_free_tx;
2153                    }
2154            }

Fixes: ec25bc04ed ("net/mlx4_en: Add resilience in low memory systems")
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 21:27:05 -05:00
David S. Miller e2160156bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All merge conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 16:54:00 -05:00
Jack Morgenstein d585df1c5c net/mlx4_core: Avoid command timeouts during VF driver device shutdown
Some Hypervisors detach VFs from VMs by instantly causing an FLR event
to be generated for a VF.

In the mlx4 case, this will cause that VF's comm channel to be disabled
before the VM has an opportunity to invoke the VF device's "shutdown"
method.

The result is that the VF driver on the VM will experience a command
timeout during the shutdown process when the Hypervisor does not deliver
a command-completion event to the VM.

To avoid FW command timeouts on the VM when the driver's shutdown method
is invoked, we detect the absence of the VF's comm channel at the very
start of the shutdown process. If the comm-channel has already been
disabled, we cause all FW commands during the device shutdown process to
immediately return success (and thus avoid all command timeouts).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:45:27 -05:00
Shaker Daibes 1f8176f735 net/mlx4_en: Check the enabling pptx/pprx flags in SET_PORT wrapper flow
Make sure pptx/pprx mask flag is set using new fields upon set port
request. In addition, move this code into a helper function for better
code readability.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Shaker Daibes bf1f939683 net/mlx4_en: Check the enabling mtu flag in SET_PORT wrapper flow
Make sure MTU mask flag is set using new field upon set port
request. In addition, move this code into a helper function for better
code readability.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Shaker Daibes 40fb4fc1e1 net/mlx4_en: Pass user MTU value to Firmware at set port command
When starting the port, driver will inform Firmware about the actual MTU
which does not include implicit headers, such as FCS or VLAN tags.

Signed-off-by: Shaker Daibes <shakerd@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:43 -05:00
Ariel Levkovich 297e1cf29e net/mlx4_en: Adding support of turning off link autonegotiation via ethtool
This feature will allow the user to disable auto negotiation
on the port for mlx4 devices while setting the speed is limited
to 1GbE speeds.
Other speeds will not be accepted in autoneg off mode.

This functionality is permitted providing that the firmware
is compatible with this feature.
The above is determined by querying a new dedicated capability
bit in the device.

Signed-off-by: Ariel Levkovich <lariel@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Alaa Hleihel 4b5e5b7ece net/mlx4_core: Get num_tc using netdev_get_num_tc
Avoid reading num_tc directly from struct net_device, but use
the helper function netdev_get_num_tc.

Fixes: bc6a4744b8 ("net/mlx4_en: num cores tx rings for every UP")
Fixes: f5b6345ba8 ("net/mlx4_en: User prio mapping gets corrupted when changing number of channels")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Matan Barak ae5a2e29d1 net/mlx4_core: Add resource alloc/dealloc debugging
In order to aid debugging of functions that take a resource but
don't put it, add the last function name that successfully grabbed
this resource.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Yishai Hadas 3835336401 net/mlx4_core: Device revision support
The device revision field returned by the NodeInfo MAD is incorrect
on ConnectX3 devices.

This patch is driver side handling to complete a FW fix added at 2.11.1172.
INIT_HCA - bit at offset 0x0C.12 is set to 1 so that FW will report
correct device revision.

Older FW versions won't be affected from turning on that bit,
no capability bit is needed.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:42 -05:00
Tariq Toukan 72b8eaab24 net/mlx4: Replace ENOSYS with better fitting error codes
Conform the following warning:
WARNING: ENOSYS means 'invalid syscall nr' and nothing else.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-30 15:26:41 -05:00
David S. Miller 4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Daniel Borkmann a67edbf4fb bpf: add initial bpf tracepoints
This work adds a number of tracepoints to paths that are either
considered slow-path or exception-like states, where monitoring or
inspecting them would be desirable.

For bpf(2) syscall, tracepoints have been placed for main commands
when they succeed. In XDP case, tracepoint is for exceptions, that
is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
return code, or when error occurs during XDP_TX action and the packet
could not be forwarded.

Both have been split into separate event headers, and can be further
extended. Worst case, if they unexpectedly should get into our way in
future, they can also removed [1]. Of course, these tracepoints (like
any other) can be analyzed by eBPF itself, etc. Example output:

  # ./perf record -a -e bpf:* sleep 10
  # ./perf script
  sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  [...]
  sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
       swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

  [1] https://lwn.net/Articles/705270/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 13:17:47 -05:00
Geliang Tang 3704eb6f6f net/mlx4: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-22 16:46:13 -05:00
Eric Dumazet dceeab0e52 mlx4: support __GFP_MEMALLOC for rx
Commit 04aeb56a17 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.

Note that this driver does not reuse pages (yet) so we do not have to
add anything else.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-19 23:35:12 -05:00
Eran Ben Elisha e91ef71dfe net/mlx4_en: Remove unnecessary checks when setting num channels
Boundaries checks for the number of RX, TX, other and combined channels
should be checked by the caller and not in the driver.

In addition, remove wrong memset on get channels as it overrides the cmd
field in the requester struct.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 14:58:23 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Jack Morgenstein 9577b174cd net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
When running SRIOV, warnings for SRQ LIMIT events flood the Hypervisor's
message log when (correct, normally operating) apps use SRQ LIMIT events
as a trigger to post WQEs to SRQs.

Add more information to the existing debug printout for SRQ_LIMIT, and
output the warning messages only for the SRQ CATAS ERROR event.

Fixes: acba2420f9 ("mlx4_core: Add wrapper functions and comm channel and slave event support to EQs")
Fixes: e0debf9cb5 ("mlx4_core: Reduce warning message for SRQ_LIMIT event to debug level")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:29 -05:00
Jack Morgenstein 7c3945bc20 net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
Save the qp context flags byte containing the flag disabling vlan stripping
in the RESET to INIT qp transition, rather than in the INIT to RTR
transition. Per the firmware spec, the flags in this byte are active
in the RESET to INIT transition.

As a result of saving the flags in the incorrect qp transition, when
switching dynamically from VGT to VST and back to VGT, the vlan
remained stripped (as is required for VST) and did not return to
not-stripped (as is required for VGT).

Fixes: f0f829bf42 ("net/mlx4_core: Add immediate activate for VGT->VST->VGT")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:28 -05:00
Jack Morgenstein 291c566a28 net/mlx4_core: Fix racy CQ (Completion Queue) free
In function mlx4_cq_completion() and mlx4_cq_event(), the
radix_tree_lookup requires a rcu_read_lock.
This is mandatory: if another core frees the CQ, it could
run the radix_tree_node_rcu_free() call_rcu() callback while
its being used by the radix tree lookup function.

Additionally, in function mlx4_cq_event(), since we are adding
the rcu lock around the radix-tree lookup, we no longer need to take
the spinlock. Also, the synchronize_irq() call for the async event
eliminates the need for incrementing the cq reference count in
mlx4_cq_event().

Other changes:
1. In function mlx4_cq_free(), replace spin_lock_irq with spin_lock:
   we no longer take this spinlock in the interrupt context.
   The spinlock here, therefore, simply protects against different
   threads simultaneously invoking mlx4_cq_free() for different cq's.

2. In function mlx4_cq_free(), we move the radix tree delete to before
   the synchronize_irq() calls. This guarantees that we will not
   access this cq during any subsequent interrupts, and therefore can
   safely free the CQ after the synchronize_irq calls. The rcu_read_lock
   in the interrupt handlers only needs to protect against corrupting the
   radix tree; the interrupt handlers may access the cq outside the
   rcu_read_lock due to the synchronize_irq calls which protect against
   premature freeing of the cq.

3. In function mlx4_cq_event(), we change the mlx_warn message to mlx4_dbg.

4. We leave the cq reference count mechanism in place, because it is
   still needed for the cq completion tasklet mechanism.

Fixes: 6d90aa5cf1 ("net/mlx4_core: Make sure there are no pending async events when freeing CQ")
Fixes: 225c7b1fee ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 15:08:28 -05:00
Eric Dumazet 8cf699ec84 mlx4: do not call napi_schedule() without care
Disable BH around the call to napi_schedule() to avoid following warning

[   52.095499] NOHZ: local_softirq_pending 08
[   52.421291] NOHZ: local_softirq_pending 08
[   52.608313] NOHZ: local_softirq_pending 08

Fixes: 8d59de8f7b ("net/mlx4_en: Process all completions in RX rings after port goes up")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Erez Shitrit <erezsh@mellanox.com>
Cc: Eugenia Emantayev <eugenia@mellanox.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 11:44:10 -05:00
David S. Miller 02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Martin KaFai Lau 9f9b74ef89 mlx4: Return EOPNOTSUPP instead of ENOTSUPP
In commit b45f0674b9 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs"),
it changed EOPNOTSUPP to ENOTSUPP by mistake.  This patch fixes it.

Fixes: b45f0674b9 ("mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 21:16:43 -05:00
stephen hemminger bc1f44709c net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 17:51:44 -05:00
Jack Morgenstein 10b1c04e92 net/mlx4_core: Fix raw qp flow steering rules under SRIOV
Demoting simple flow steering rule priority (for DPDK) was achieved by
wrapping FW commands MLX4_QP_FLOW_STEERING_ATTACH/DETACH for the PF
as well, and forcing the priority to MLX4_DOMAIN_NIC in the wrapper
function for the PF and all VFs.

In function mlx4_ib_create_flow(), this change caused the main rule
creation for the PF to be wrapped, while it left the associated
tunnel steering rule creation unwrapped for the PF.

This mismatch caused rule deletion failures in mlx4_ib_destroy_flow()
for the PF when the detach wrapper function did not find the associated
tunnel-steering rule (since creation of that rule for the PF did not
go through the wrapper function).

Fix this by setting MLX4_QP_FLOW_STEERING_ATTACH/DETACH to be "native"
(so that the PF invocation does not go through the wrapper), and perform
the required priority demotion for the PF in the mlx4_ib_create_flow()
code path.

Fixes: 48564135cb ("net/mlx4_core: Demote simple multicast and broadcast flow steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Slava Shwartsman 61b6034c6c net/mlx4_en: Fix type mismatch for 32-bit systems
is_power_of_2 expects unsigned long and we pass u64 max_val_cycles,
this will be truncated on 32 bit systems, and the result is not what we
were expecting.
div_u64 expects u32 as a second argument and we pass
max_val_cycles_rounded which is u64 hence it will always be truncated.
Fix was tested on both 64 and 32 bit systems and got same results for
max_val_cycles and max_val_cycles_rounded.

Fixes: 4850cf4581 ("net/mlx4_en: Resolve dividing by zero in 32-bit system")
Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Leon Romanovsky c1d5f8ff80 net/mlx4: Remove BUG_ON from ICM allocation routine
This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address alignment in advance and performing proper
folding in case of error.

Fixes: 5b0bf5e25e ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas <okaratas@member.fsf.org>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Eugenia Emantayev 6496bbf0ec net/mlx4_en: Fix bad WQE issue
Single send WQE in RX buffer should be stamped with software
ownership in order to prevent the flow of QP in error in FW
once UPDATE_QP is called.

Fixes: 9f519f68cf ('mlx4_en: Not using Shared Receive Queues')
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Jack Morgenstein 3b01fe7f91 net/mlx4_core: Use-after-free causes a resource leak in flow-steering detach
mlx4_QP_FLOW_STEERING_DETACH_wrapper first removes the steering
rule (which results in freeing the rule structure), and then
references a field in this struct (the qp number) when releasing the
busy-status on the rule's qp.

Since this memory was freed, it could reallocated and changed.
Therefore, the qp number in the struct may be incorrect,
so that we are releasing the incorrect qp. This leaves the rule's qp
in the busy state (and could possibly release an incorrect qp as well).

Fix this by saving the qp number in a local variable, for use after
removing the steering rule.

Fixes: 2c473ae7e5 ("net/mlx4_core: Disallow releasing VF QPs which have steering rules")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-29 14:17:40 -05:00
Linus Torvalds 8f18e4d03e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.

    The most important is to not assume the packet is RX just because
    the destination address matches that of the device. Such an
    assumption causes problems when an interface is put into loopback
    mode.

 2) If we retry when creating a new tc entry (because we dropped the
    RTNL mutex in order to load a module, for example) we end up with
    -EAGAIN and then loop trying to replay the request. But we didn't
    reset some state when looping back to the top like this, and if
    another thread meanwhile inserted the same tc entry we were trying
    to, we re-link it creating an enless loop in the tc chain. Fix from
    Daniel Borkmann.

 3) There are two different WRITE bits in the MDIO address register for
    the stmmac chip, depending upon the chip variant. Due to a bug we
    could set them both, fix from Hock Leong Kweh.

 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: stmmac: fix incorrect bit set in gmac4 mdio addr register
  r8169: add support for RTL8168 series add-on card.
  net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
  openvswitch: upcall: Fix vlan handling.
  ipv4: Namespaceify tcp_tw_reuse knob
  net: korina: Fix NAPI versus resources freeing
  net, sched: fix soft lockup in tc_classify
  net/mlx4_en: Fix user prio field in XDP forward
  tipc: don't send FIN message from connectionless socket
  ipvlan: fix multicast processing
  ipvlan: fix various issues in ipvlan_process_multicast()
2016-12-27 16:04:37 -08:00
Thomas Gleixner a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Tariq Toukan eb9def61be net/mlx4_en: Fix user prio field in XDP forward
The user prio field is wrong (and overflows) in the XDP forward
flow.
This is a result of a bad value for num_tx_rings_p_up, which should
account all XDP TX rings, as they operate for the same user prio.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
Linus Torvalds 0ab7b12c49 pci-v4.10-changes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
 mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
 fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
 v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
 iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
 kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
 eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
 1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
 wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
 juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
 84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
 tcfJPEcz1019PKZNSvWy
 =ut0k
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "PCI changes:

   - add support for PCI on ARM64 boxes with ACPI. We already had this
     for theoretical spec-compliant hardware; now we're adding quirks
     for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)

   - add runtime PM support for hotplug ports

   - enable runtime suspend for Intel UHCI that uses platform-specific
     wakeup signaling

   - add yet another host bridge registration interface. We hope this is
     extensible enough to subsume the others

   - expose device revision in sysfs for DRM

   - to avoid device conflicts, make sure any VF BAR updates are done
     before enabling the VF

   - avoid unnecessary link retrains for ASPM

   - allow INTx masking on Mellanox devices that support it

   - allow access to non-standard VPD for Chelsio devices

   - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
     etc

   - update Rockchip support for max-link-speed

   - add NVIDIA Tegra210 support

   - add Layerscape LS1046a support

   - update R-Car compatibility strings

   - add Qualcomm MSM8996 support

   - remove some uninformative bootup messages"

* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
  PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
  PCI: Expand "VPD access disabled" quirk message
  PCI: pciehp: Remove loading message
  PCI: hotplug: Remove hotplug core message
  PCI: Remove service driver load/unload messages
  PCI/AER: Log AER IRQ when claiming Root Port
  PCI/AER: Log errors with PCI device, not PCIe service device
  PCI/AER: Remove unused version macros
  PCI/PME: Log PME IRQ when claiming Root Port
  PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
  PCI: Move config space size macros to pci_regs.h
  x86/platform/intel-mid: Constify mid_pci_platform_pm
  PCI/ASPM: Don't retrain link if ASPM not possible
  PCI: iproc: Skip check for legacy IRQ on PAXC buses
  PCI: pciehp: Leave power indicator on when enabling already-enabled slot
  PCI: pciehp: Prioritize data-link event over presence detect
  PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
  PCI: rcar: Use gen2 fallback compatibility last
  PCI: rcar-gen2: Use gen2 fallback compatibility last
  PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
  ..
2016-12-15 12:46:48 -08:00
Linus Torvalds 9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Martin KaFai Lau ea3349a035 mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00