We need to use a different loop index for mlx4_counter_alloc() and for
device_create_file() iterations: the mlx4_counter_alloc() loop index
is used in the error flow to free counters.
If the same loop index is used for device_create_file() and, say, the
device_create_file() loop fails on the first iteration, the allocated
counters will not be freed.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Enable IB ULPs to use a larger portion of the device EQs (which map to
IRQs). The mlx4_ib driver follows the mlx4_core framework of the EQs
to be divided among the device ports. In this scheme, for each IB
port, the number of allocated EQs follows the number of cores, subject
to other system constraints, such as number available MSI-X vectors.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
[ Replace one more printk_once() with pr_info_once(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Otherwise CM packets going over MLX QP1 get fixed scheduling priority 0.
We want CM packets to get the same scheduling priority, and therefore
map to the same SQ (Schedule Queue) and eventually TC (Traffic Class),
as the application requested for the actual QP used for the connection.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement raw packet QPs for Ethernet ports using the MLX transport (as
done by the mlx4_en Ethernet netdevice driver).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- fix memory leak in mlx4
- fix two problems with new MAD response generation code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJPmYfMAAoJEENa44ZhAt0h06IP/jPJeH855wC2RgWXY2Fd+b6w
K/DQVE2MaX2f8EHuAB5MKoC0zzchKNu0GTMbrZMizbocDBmIU6ibF3MKx2rycSBC
FI906X7FFAeSyaywrI+k+ZGxVQYAck1QFDr44WiyJ400r8BGbx9ZECKvIyEwjaUa
RpRn4Pui4IjMV7GFqiziUsa9uo4xKYagjNfeInAZXelzIN76Coeae28dAIE2rPT/
BTcrtQ2JMG6H4AVGTImTP+Kqk1Cm7apBeHS4+SGWWTdw1WH4IcMqPmxHBUUtKTWt
UnlgpHquuC913VpkbkGeog32iAndhHgAkfs67Rs0IFwck127Il/DiPK/zir2zkZ2
YVilzYIxrSUYqPZYfKFcrKsnjfM+HJxBEPCfDjxfG0TjrVS8tnc75Uq1NvIt6e42
lcIclLDyZtdsMLmJlsTsT5cF7HlThSjpZDSgmMi2eeOI0RKoqRyyGGwiAIq7Ud2a
oRa+bheh4pBD2lvzd62uEkgMkTaToAz3I/vlbOTSwxYg2skoHUpwNPCsmstsQ7RU
BHi7Nl5DLNT5lNBTIouWvbilpXs8eXwgcNrQcWSYaPeY2aDsGAAdi9Q4BadENao9
dmxOtqjryctX7U7LGDRmNDIfiyPY5AodDUJjFux1JwqbdSCGf1Lz7uvDWu7gqL9V
TCJFAK54okexpoyV6Vgv
=iRrT
-----END PGP SIGNATURE-----
Merge tag 'ib-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband fixes from Roland Dreier:
"A few fixes for regressions introduced in 3.4-rc1:
- fix memory leak in mlx4
- fix two problems with new MAD response generation code"
* tag 'ib-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Fix memory leaks in ib_link_query_port()
IB/mad: Don't send response for failed MADs
IB/mad: Set 'D' bit in response for unhandled MADs
If the call to mlx4_MAD_IFC() fails in ib_link_query_port() we will
currently do 'return err;' which will leak 'in_mad' and 'out_mad'. We
should instead do 'goto out;' where we'll properly free the memory we
previously allocated.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the IB port is down, the active_speed value returned by the
MAD_IFC command is seven (7) which isn't among the defined IB speeds
in enum ib_port_speed, and this invalid speed value is passed up to
higher layers or applications who do port query.
Fix that by setting the speed to be SDR -- the lowest possible -- when
the port is down.
Signed-off-by: Roland Dreier <roland@purestorage.com>
To issue a port query, use the QUERY_(Ethernet)_PORT command instead
of the MAD_IFC command, since MAD_IFC attempts to query the firmware
IB SMA, which is irrelevant for IBoE ports.
This allows us to handle both 10Gb/s and 40Gb/s rates (e.g in sysfs),
using QDR speed (10Gb/s) and width of 1X or 4X.
Signed-off-by: Dotan Barak <dotanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If an erroneous CQE is polled in the first iteration (i.e. npolled ==
0), we don't update the consumer index and hence the hardware could
get a wrong notion of how many CQEs software polled. Fix this by
unconditionally updating the doorbell record. We could change the
check to be something like
if (npolled || err != -EAGAIN)
...
but it does not seem worth the effort since a posted write to memory
should not cost too much.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use a bit in wc_flags rather then a whole integer to hold the
"checksum OK" flag. By itself, this change doesn't reduce the size of
struct ib_wc on 64bit machines -- it stays on 56 bytes because of
padding. However, it will allow to add more fields in the future
without enlarging the struct. Also, it will let us have a unified
approach with future libibverbs checksum offload reporting, because a
bit flag doesn't break the library ABI.
This patch was suggested during conversation with Liran Liss
<liranl@mellanox.com>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
While doing the work for commit a6f7feae6d ("IB/mlx4: pass SMP
vendor-specific attribute MADs to firmware") we realized that the
firmware would respond on all sorts of vendor-specific MADs.
Therefore commit 97285b7817 ("mlx4_core: Add extended port
capabilities support") adds redundant code into the driver, since
there's no real reaon to maintain the extended capabilities of the
port, as they can be queried on demand (e.g the FDR10 capability).
This patch reverts commit 97285b7817 and removes the check for
extended caps from the mlx4_ib driver port query flow.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file. Add that enum, and use
it all over the code.
The IB speed/width notation is also used by iWARP and IBoE HW drivers,
which use the convention of rate = speed * width to advertise their
port link rate.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
ConnectX devices have a limit on the number of mappings that can be
done on an FMR before having to call sync_tpt. The current
mlx4_ib driver reports the limit correctly in max_map_per_fmr in
.query_device(), but mlx4_core doesn't check it when actually
allocating FMRs.
Add a max_fmr_maps field to struct mlx4_caps and enforce this maximum
value on FMR allocations.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the opcode of a work request exceeds the range of valid opcodes,
return the pointer to the offending work request.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the current code, vendor-specific MADs (e.g with the FDR-10
attribute) are silently dropped by the driver, resulting in timeouts
at the sending side and inability to query/configure the relevant
feature. However, the ConnectX firmware is able to handle such MADs.
For unsupported attributes, the firmware returns a GET_RESPONSE MAD
containing an error status.
For example, for a FDR-10 node with LID 11:
# ibstat mlx4_0 1
CA: 'mlx4_0'
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 11
LMC: 0
SM lid: 24
Capability mask: 0x02514868
Port GUID: 0x0002c903002e65d1
Link layer: InfiniBand
Extended Port Query (EPI) vendor mad timeouts before the patch:
# smpquery MEPI 11 -d
ibwarn: [4196] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [4196] _do_madrpc: retry 1 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: retry 2 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: timeout after 3 retries, 3000 ms
ibwarn: [4196] mad_rpc: _do_madrpc failed; dport (Lid 11)
smpquery: iberror: [pid 4196] main: failed: operation EPI: ext port info query failed
EPI query works OK with the patch:
# smpquery MEPI 11 -d
ibwarn: [6548] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [6548] mad_rpc: data offs 64 sz 64
mad data
0000 0000 0000 0001 0000 0001 0000 0001
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
# Ext Port info: Lid 11 port 0
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x01
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Ira Weiny <weiny2@llnl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For IBoE, SLs 0-7 are mapped to Ethernet 802.1Q user priority bits
(pbits) which are part of the VLAN tag, SLs 8-15 are reserved.
Under Ethernet, the ConnectX firmware treats (decode/encode) the four
bit SL field in various constructs such as QPC / UD WQE / CQE as PPP0
and not as 0PPP. This correlates well to the fact that within the
vlan tag the pbits are located in bits 15-13 and not 12-14.
The current code wasn't consistent around that area - the
encoding was correct for the IBoE QPC.path.schedule_queue field,
but was wrong for IBoE CQEs and when MLX header was built.
These inconsistencies resulted in wrong SL <--> wire 802.1Q pbits
mapping, which is fixed by using SL <--> PPP0 all around the place.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
For SRIOV, some Hypervisor commands can be executed directly (native = 1).
Others should go through the command wrapper flow (for tracking resource
usage, for example, or for changing some HCA configurations that slaves
need to be notified of).
This patch sets the groundwork for this capability -- adding the correct
value of "native" in each case.
Note that if SRIOV is not activated, this parameter has no effect.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port mask now has additional state.
Port can be set as "none". In this case neither the mlx4_en or mlx4_ib
drivers take ownership of the port.
In multifunction mode there is an option to set the vfs as single ported devices.
(in single function mode, both physical ports belong to same function)
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cfcde11c3d ("IB/mlx4: Use flow counters on IBoE ports") added
code that sets elements of counters[] to -1 if no counter is allocated,
but then goes ahead and passes every entry to mlx4_counter_free() on
shutdown. This is a bad idea, especially if MLX4_DEV_CAP_FLAG_COUNTERS
isn't set so there isn't even an underlying bitmap to free from.
Tested-by: Sean Hefty <sean.hefty@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
mlx4_core: Deprecate log_num_vlan module param
IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
IB/mlx4: Enable 4K mtu for IBoE
RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
RDMA/cxgb4: Serialize calls to CQ's comp_handler
RDMA/cxgb3: Serialize calls to CQ's comp_handler
IB/qib: Fix issue with link states and QSFP cables
IB/mlx4: Configure extended active speeds
mlx4_core: Add extended port capabilities support
IB/qib: Hold links until tuning data is available
IB/qib: Clean up checkpatch issue
IB/qib: Remove s_lock around header validation
IB/qib: Precompute timeout jiffies to optimize latency
IB/qib: Use RCU for qpn lookup
IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
IB/qib: Decode path MTU optimization
IB/qib: Optimize RC/UC code by IB operation
IPoIB: Use the right function to do DMA unmap pages
RDMA/cxgb4: Use correct QID in insert_recv_cqe()
RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
...
There's no need to set the vlan-related fields in an IBoE send WQE
control segment:
- the vlan to be used by a UD QP is set in the datagram segment.
- for GSI (CM) QP, all the headers down to 8021q and MAC are built by
the software anyway.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IBoE port MTU is derived from the corresponding Ethernet netdevice
MTU, which can support jumbo frames of 9K, and hence surely supports
the max IB mtu of 4K.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the extended active speeds based on the hardware configuration.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
[ Move FDR-10 handling into ib_link_query_port(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow processes that share the same XRC domain to open an existing
shareable QP. This permits those processes to receive events on the
shared QP and transfer ownership, so that any process may modify the
QP. The latter allows the creating process to exit, while a remaining
process can still transition it for path migration purposes.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support the creation of XRC INI and TGT QPs. To handle the case where
a CQ or PD is not provided, we allocate them internally with the xrcd.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to create XRC SRQs. This patch is based on a patch
from Jack Morgenstrein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support creating and destroying XRC domains. Any sharing of the XRCD
is managed above the low-level driver.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use the per port counter attached to all QPs created on that port to
implement port level packets/bytes performance counters a la IB.
Derived from a patch by Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate flow counter per Ethernet/IBoE port, and attach this counter
to all the QPs created on that port. Based on patch by Eli Cohen
<eli@mellanox.co.il>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IBoE doesn't use LIDs. Use the GID change event to update the IB core
cache for addition/deletion of GIDs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The same packet steering mechanism would be used both for IB and Ethernet,
Both multicasts and unicasts.
This commit prepares the general infrastructure for this.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
HW revision is derived from device ID and rev id.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
The newest device firmware stores IB vs. Ethernet protocol in two bits
in members_count field of multicast group table (0: Infiniband, 1:
Ethernet). When changing the QP members count for a multicast group,
it important not to reset this information. When calling multicast
attach first time, the protocol type should be specified. In this
patch we always set it IB, but in the future we will handle Ethernet
too. When looking for a QP, the protocol type shoud be checked too.
Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems have PCI addresses that don't fit in unsigned long (eg some
32-bit PowerPC 440 systems have 36-bit bus addresses). Fix up mlx4 drivers
by using phys_addr_t where appropriate, so we don't truncate any PCI
resource addresses before ioremapping them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Fix refcount leak in lkey/rkey validation
IB/qib: Improve SERDES tunning on QMH boards
IB/qib: Unnecessary delayed completions on RC connection
IB/qib: Issue pre-emptive NAKs on eager buffer overflow
IB/qib: RDMA lkey/rkey validation is inefficient for large MRs
IB/qib: Change QPN increment
IB/qib: Add fix missing from earlier patch
IB/qib: Change receive queue/QPN selection
IB/qib: Fix interrupt mitigation
IB/qib: Avoid duplicate writes to the rcv head register
IB/qib: Add a few new SERDES tunings
IB/qib: Reset packet list after freeing
IB/qib: New SERDES init routine and improvements to SI quality
IB/qib: Clear WAIT_SEND flags when setting QP to error state
IB/qib: Fix context allocation with multiple HCAs
IB/qib: Fix multi-Florida HCA host panic on reboot
IB/qib: Handle transitions from ACTIVE_DEFERRED to ACTIVE better
IB/qib: UD send with immediate receive completion has wrong size
IB/qib: Set port physical state even if other fields are invalid
IB/qib: Generate completion callback on errors
...
ib_create_send_mad() can return ERR_PTR(-ENOMEM) here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_free_cq_buf() should not be called under spin_lock_irq() since
it calls dma_free_coherent(), which needs irqs enabled. Fix this by
deferring the free to outside the locked region.
This was found due to the
WARN_ON(irqs_disabled());
in swiotlb_free_coherent().
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use netif_running() and netif_carrier_ok() to report link state,
exactly as is done to report Ethernet link state in sysfs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The link rate is the product of the link speed in the link width. For
Etherent ports the rate is 10G, so we use 1 for the width and 4 for
speed to get the correct rate.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We must fully update the control segment before marking it as valid,
so that hardware doesn't start executing it before we're ready.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Move VLAN control bit setting to before wmb(). - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev_base_lock is the legacy way to lock the device list, and is planned
to disappear. (writers hold RTNL, readers hold RCU lock)
Convert rdma_translate_ip() and update_ipv6_gids() to RCU locking.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (63 commits)
IB/qib: clean up properly if pci_set_consistent_dma_mask() fails
IB/qib: Allow driver to load if PCIe AER fails
IB/qib: Fix uninitialized pointer if CONFIG_PCI_MSI not set
IB/qib: Fix extra log level in qib_early_err()
RDMA/cxgb4: Remove unnecessary KERN_<level> use
RDMA/cxgb3: Remove unnecessary KERN_<level> use
IB/core: Add link layer type information to sysfs
IB/mlx4: Add VLAN support for IBoE
IB/core: Add VLAN support for IBoE
IB/mlx4: Add support for IBoE
mlx4_en: Change multicast promiscuous mode to support IBoE
mlx4_core: Update data structures and constants for IBoE
mlx4_core: Allow protocol drivers to find corresponding interfaces
IB/uverbs: Return link layer type to userspace for query port operation
IB/srp: Sync buffer before posting send
IB/srp: Use list_first_entry()
IB/srp: Reduce number of BUSY conditions
IB/srp: Eliminate two forward declarations
IB/mlx4: Signal node desc changes to SM by using FW to generate trap 144
IB: Replace EXTRA_CFLAGS with ccflags-y
...
This patch allows IBoE traffic to be encapsulated in 802.1Q tagged
VLAN frames. The VLAN tag is encoded in the GID and derived from it
by a simple computation.
The netdev notifier callback is modified to catch VLAN device
addition/removal and the port's GID table is updated to reflect the
change, so that for each netdevice there is an entry in the GID table.
When the port's GID table is exhausted, GID entries will not be added.
Only children of the main interfaces can add to the GID table; if a
VLAN interface is added on another VLAN interface (e.g. "vconfig add
eth2.6 8"), then that interfaces will not add an entry to the GID
table.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:
GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.
The 3 bits user priority field of the packets are identical to the 3
bits of the SL.
In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE to mlx4_ib. The bulk of the code is handling the
new address vector fields; mlx4 needs the MAC address of a remote node
to include it in a WQE (for datagrams) or in the QP context (for
connected QPs). Address resolution is done by assuming all unicast
GIDs are either link-local IPv6 addresses.
Multicast group attach/detach needs to update the NIC's multicast
filters; but since attaching a QP to a multicast group can be done
before the QP is bound to a port, for IBoE we need to keep track of
all multicast groups that a QP is attached too before it transitions
from INIT to RTR (since it does not have a port in the INIT state).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Many things cleaned up and otherwise monkeyed with; hope I didn't
introduce too many bugs. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The Node Description cannot be changed via MADs (it is read-only).
Until now, it was changed in the driver via sysfs, and the new Node
Description was simply inserted by the driver into MAD responses
(replacing the description returned by FW).
System startup scripts use the sysfs interface to change the node
description at driver startup to show the hostname, etc. However, this
has a race condition: the SM could discover the original FW node
description rather than the system-specific description if it queried the
port before the startup scripts finish running.
For mlx4, we fix this with a new FW command (SET_NODE) that allows
passing the new node description to FW. When this command is invoked,
FW sends a trap 144 to the SM. When it gets this trap, the SM can
query the node to obtain the new node description -- thus eliminating
the effects of the race.
This patch simply calls SET_NODE command when a new node description
is entered via sysfs (thus causing trap 144 to be issued by the FW).
We ignore all failures of the SET_NODE command (including those caused
by using a device FW that predates the SET_NODE command), since in
that case things work just as before.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for packing IBoE packet headers.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Clean up and fix ib_ud_header_init() a bit. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the limit on the size of max fast registration WRs that can be
posted to match hardware capabilities.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs. This allows
low-level device drivers to create files in
/sys/class/infiniband/<hca>/ports/<N>/
without having to poke through the internals of the RDMA sysfs handling.
There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for masked atomic operations (masked compare and swap,
masked fetch and add).
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Check correct variable for allocation failure
RDMA/nes: Correct cap.max_inline_data assignment in nes_query_qp()
RDMA/cm: Set num_paths when manually assigning path records
IB/cm: Fix device_create() return value check
The intent here is to check the "mfrpl->mapped_page_list" allocation.
We checked "mfrpl->ibfrpl.page_list" earlier.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
ib_ud_header_init() first clears header and then fills up the various
fields. Later on, it tests header->immediate_present, which it has
already cleared, so the condition is always false. Fix this by adding
an immediate_present parameter and setting header->immediate_present
as is done with grh_present. Also remove unused calculation of
header_len.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_qp already holds a pointer to the ib device. No need to dive to the
hw device object to retrieve it.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mlx4_ib_post_recv(), we should check the queue for overflow using
recv_cq instead of send_cq (current code looks like a copy-and-paste
mistake).
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As for memfree mthca hardware, ConnectX also requires SRQ WQE scatter
entries to be initialized with the invalid L_Key at SRQ creation time.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code has a limitation: an LSO header is not allowed to cross a
64 byte boundary. This patch removes this limitation by setting the
WQE RR for large headers thus allowing LSO headers of any size. The
extra buffer reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64
to 128 bytes, assuming this is reasonable upper limit for header
length. Also, this patch will cause IB_DEVICE_UD_TSO to be set only
for HCA FW versions that set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version
2.6.000 and higher.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is no such flag DE - the field is reserved and should be zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Userspace apps are supposed to release all ib device resources if they
receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the
app has no way of knowing when the device has come back up, except to
repeatedly attempt ibv_open_device() until it succeeds.
However, currently there is no protection against the open succeeding
while the device is in being removed following the fatal event. In
this case, the open will succeed, but as a result the device waits in
the middle of its removal until the new app releases its resources --
and the new app will not do so, since the open succeeded at a point
following the fatal event generation.
This patch adds an "active" flag to the device. The active flag is set
to false (in the fatal event flow) before the "fatal" event is
generated, so any subsequent ibv_dev_open() call to the device will
fail until the device comes back up, thus preventing the above
deadlock.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_lock_cqs()/mlx4_ib_unlock_cqs() are helper functions that
lock/unlock both CQs attached to a QP in the proper order to avoid
AB-BA deadlocks. Annotate this so sparse can understand what's going
on (and warn us if we misuse these functions).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace open-coded reimplementations with printk_once().
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ConnectX Programmer's Reference Manual states that the "SO" bit
must be set when posting Fast Register and Local Invalidate send work
requests. When this bit is set, the work request will be executed
only after all previous work requests on the send queue have been
executed. (If the bit is not set, Fast Register and Local Invalidate
WQEs may begin execution too early, which violates the defined
semantics for these operations)
This fixes the issue with NFS/RDMA reported in
<http://lists.openfabrics.org/pipermail/general/2009-April/059253.html>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The low-level mlx4 driver modified the page-list addresses for fast
register work requests post send to big-endian, and set a "present"
bit. This caused problems later when the consumer attempted to unmap
the pages using the page-list (using the list addresses which were
assumed to be still in CPU-endian order). Fix the mlx4 driver to
allocate two buffers and use a private buffer for the hardware-format
bus addresses.
This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1571>,
an NFS/RDMA server crash. The cause of the crash was found by Vu Pham
of Mellanox. The fix is along the lines suggested by Steve Wise in
comment #21 in bug 1571.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PAT work on x86 has finally made pgprot_writecombine() a usable API
for modular drivers. As the comment indicates, this is exactly what we
want to use in mlx4_ib to map BlueFlame pages up to userspace, since
using WC for these pages improves small message latency significantly.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to the ConnectX programmer's reference manual, all
operations should be stopped, all QPs should be torn down and all WQEs
flushed before the CLOSE_PORT command is invoked. In some cases
reversing the order of operations (as implemented now) could cause
a loss of completions.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When snooping a PortInfo MAD, its client_reregister bit is checked.
If the bit is ON then a CLIENT_REREGISTER event is dispatched,
otherwise a LID_CHANGE event is dispatched. This way of decision
ignores the cases where the MAD changes the LID along with an
instruction to reregister (so a necessary LID_CHANGE event won't be
dispatched) or the MAD is neither of these (and an unnecessary
LID_CHANGE event will be dispatched).
This causes problems at least with IPoIB, which will do a "light"
flush on reregister, rather than the "heavy" flush required due to a
LID change.
Fix this by dispatching a CLIENT_REREGISTER event if the
client_reregister bit is set, but also compare the LID in the MAD to
the current LID. If and only if they are not identical then a
LID_CHANGE event is dispatched.
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current work request posting code writes the LSO segment before
writing any data segments. This leaves a window where the LSO segment
overwrites the stamping in one cacheline that the HCA prefetches
before the rest of the cacheline is filled with the correct data
segments. When the HCA processes this work request, a local
protection error may result.
Fix this by saving the LSO header size field off and writing it only
after all data segments are written. This fix is a cleaned-up version
of a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1383>.
Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/iser: Add dependency on INFINIBAND_ADDR_TRANS
IPoIB: Do not join broadcast group if interface is brought down
RDMA/nes: Fix for NIPQUAD removal
IPoIB: Fix loss of connectivity after bonding failover on both sides
IB/mlx4: Don't register IB device for adapters with no IB ports
mlx4_core: Fix warning from min()
IB/ehca: spin_lock_irqsave() takes an unsigned long
If the mlx4_ib driver finds an adapter that has only ethernet ports, the
current code will register an IB device with 0 ports. Nothing useful or
sensible can be done with such a device, so just skip registering it.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f780a9f1 ("mlx4_core: Add ethernet fields to CQE struct")
introduced a bug in how wc->sl is set in mlx4_ib_poll_one() -- since
cqe->sl_vid is a big-endian value, the shift must be done after
converting to host endianness.
This bug was found using sparse endianness checking.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When resizing a CQ, when copying over unpolled CQEs from the old CQE
buffer to the new buffer, the ownership bit must be set appropriately
for the new buffer, or the ownership bit in the new buffer gets
corrupted.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When using MSI-X mode, create a completion event queue for each CPU.
Report the number of completion EQs in a new struct mlx4_caps member,
num_comp_vectors, and extend the mlx4_cq_alloc() interface with a
vector parameter so that consumers can specify which completion EQ
should be used to report events for the CQ being created.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When resizing a CQ, MTTs associated with the old CQE buffer were not
freed. As a result, if any app used resize CQ repeatedly, all MTTs
were eventually exhausted, which led to all memory registration
operations failing until the driver is reloaded.
Once the RESIZE_CQ command returns successfully from FW, FW no longer
accesses the old CQ buffer, so it is safe to deallocate the MTT
entries used by the old CQ buffer.
Finally, if the RESIZE_CQ command fails, the MTTs allocated for the
new CQEs buffer also need to be de-allocated.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1416>.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set mr->umem to NULL in mlx4_ib_alloc_fast_reg_mr(). Otherwise
ib_dereg_mr() may invoke ib_umem_release() on a random pointer value
and get an oops.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Multi-protocol adapters support different port types. Each consumer
of mlx4_core queries for supported port types; in particular mlx4_ib
can no longer assume that all physical ports belong to it. Port type
is configured through a sysfs interface. When the type of a port is
changed, all mlx4 interfaces are unregistered, and then registered
again with the new port types.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To allow allocating an aligned range of consecutive QP numbers, add an
interface to reserve an aligned range of QP numbers and have the QP
allocation function always take a QP number.
This will be used for RSS support in the mlx4_en Ethernet driver and
also potentially by IPoIB RSS support.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set RLKEY bit in the HW context for kernel QPs so that kernel QPs can
use the reserved L_Key for memory reference.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop()
RDMA/nes: Fix client side QP destroy
IB/mlx4: Fix up fast register page list format
mlx4_core: Set RAE and init mtt_sz field in FRMR MPT entries
Byte swap the addresses in the page list for fast register work requests
to big endian to match what the HCA expectx. Also, the addresses must
have the "present" bit set so that the HCA knows it can access them.
Otherwise the HCA will fault the first time it accesses the memory
region.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Initialize the L_Key and R_Key for memory regions returned from
mlx4_ib_alloc_fast_reg_mr(). Otherwise callers just get garbage for
the memory keys and can't do anything useful with these MRs.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current code limits the max message size to 2K for UD QPs, while MTU
might be as big as 4K. This patch sets the maximum message size to
4K, which is needed for UD to work correctly on fabrics with a 4K MTU.
Signed-off-by: Alex Naslednikov <xalex@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>