Previously, flows were manipulated by userspace specifying a full,
unmasked flow key. This adds significant burden onto flow
serialization/deserialization, particularly when dumping flows.
This patch adds an alternative way to refer to flows using a
variable-length "unique flow identifier" (UFID). At flow setup time,
userspace may specify a UFID for a flow, which is stored with the flow
and inserted into a separate table for lookup, in addition to the
standard flow table. Flows created using a UFID must be fetched or
deleted using the UFID.
All flow dump operations may now be made more terse with OVS_UFID_F_*
flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to
omit the flow key from a datapath operation if the flow has a
corresponding UFID. This significantly reduces the time spent assembling
and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags
enabled, the datapath only returns the UFID and statistics for each flow
during flow dump, increasing ovs-vswitchd revalidator performance by 40%
or more.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first user will be the next patch.
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"next" is not updated, causing an endless loop for buckets with more than
one element.
Fixes: 88d6ed15ac ("rhashtable: Convert bucket iterators to take table and index")
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel forcefully applies MTU values received in router
advertisements provided the new MTU is less than the current. This
behavior is undesirable when the user space is managing the MTU. Instead
a sysctl flag 'accept_ra_mtu' is introduced such that the user space
can control whether or not RA provided MTU updates should be applied. The
default behavior is unchanged; user space must explicitly set this flag
to 0 for RA MTUs to be ignored.
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In SRIOV, both the PF and the VF may attempt device recovery whenever they
assume that the device is not functioning. When the PF driver resets the
device, the VF should detect this and attempt to reinitialize itself.
The VF must be able to reset itself under all circumstances, even
if the PF is not responsive.
The VF shall reset itself in the following cases:
1. Commands are not processed within reasonable time over the communication channel.
This is done considering device state and the correct return code based on
the command as was done in the native mode, done in the next patch.
2. The VF driver receives an internal error event reported by the PF on the
communication channel. This occurs when the PF driver resets the device or
when VF is out of sync with the PF.
Add 'VF reset' capability, which allows the VF to reinitialize itself even when the
PF is not responsive.
As PF and VF may run their reset flow simulantanisly, there are several cases
that are handled:
- Prevent freeing VF resources upon FLR, when PF is in its unloading stage.
- Prevent PF getting VF commands before it has finished initializing its resources.
- Upon VF startup, check that comm-channel is online before sending
commands to the PF and getting timed-out.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to manage interface state to sync between reset flow and some other
relative cases such as remove_one. This has to be done to prevent certain
races. For example in case software stack is down as a result of unload call,
the remove_one should skip the unload phase.
Implement the remove_one case, handling AER and other cases comes next.
The interface can be up/down, upon remove_one, the state will include an extra
bit indicating that the device is cleaned-up, forcing other tasks to finish
before the final cleanup.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We activate reset flow upon command fatal errors, when the device enters an
erroneous state, and must be reset.
The cases below are assumed to be fatal: FW command timed-out, an error from FW
on closing commands, pci is offline when posting/pending a command.
In those cases we place the device into an error state: chip is reset, pending
commands are awakened and completed immediately. Subsequent commands will
return immediately.
The return code in the above cases will depend on the command. Commands which
free and close resources will return success (because the chip was reset, so
callers may safely free their kernel resources). Other commands will return -EIO.
Since the device's state was marked as error, the catas poller will
detect this and restart the device's software stack (as is done when a FW
internal error is directly detected). The device state is protected by a
persistent mutex lives on its mlx4_dev, as such no need any more for the
hcr_mutex which is removed.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This includes:
- resetting the chip when a fatal error is detected (the current code
does not do this).
- exposing the ability to enter error state from outside the catas code
by calling its functionality. (E.g. FW Command timeout, AER error).
- managing a persistent device state. This is needed to sync between
reset flow cases.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using a WQ per device instead of a single global WQ, this allows
independent reset handling per device even when SRIOV is used.
This comes as a pre-patch for supporting chip reset
for both native and SRIOV.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an HCA enters an internal error state, this is detected by the driver.
The driver then should reset the HCA and restart the software stack.
Keep ports information and some SRIOV configuration in a persistent area
to have it valid across reset.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the vxlan transmit path there is no need to reference the socket
for a tunnel which is needed for the receive side. We do, however,
need the vxlan_dev flags. This patch eliminate references
to the socket in the transmit path, and changes VXLAN_F_UNSHAREABLE
to be VXLAN_F_RCV_FLAGS. This mask is used to store the flags
applicable to receive (GBP, CSUM6_RX, and REMCSUM_RX) in the
vxlan_sock flags.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The UDP tunnel transmit functions udp_tunnel_xmit_skb and
udp_tunnel6_xmit_skb include a socket argument. The socket being
passed to the functions (from VXLAN) is a UDP created for receive
side. The only thing that the socket is used for in the transmit
functions is to get the setting for checksum (enabled or zero).
This patch removes the argument and and adds a nocheck argument
for checksum setting. This eliminates the unnecessary dependency
on a UDP socket for UDP tunnel transmit.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* fix network-manager which was broken by the previous changes
* fix delete-station events, which were broken by me making the
genlmsg_end() mistake
* fix a timer left running during suspend in some race conditions
that would cause an annoying (but harmless) warning
* (less important, but in the tree already) remove 80+80 MHz rate
reporting since the spec doesn't distinguish it from 160 MHz;
as the bitrate they're both 160 MHz bandwidth
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJUvUZlAAoJEDBSmw7B7bqrfNMQAJT5jjOSjmwW8Zdvujvda/qt
bFpYa9t0NsN3izzMpjPSrCwPrHN5qE86ZA8TcZrIzejPH4rpltjaXB6JNHZardVo
deCUWU9xotoPELoE0Xex9mHPEkYYvOaht/P8A/88qP1S2PykMmj9fqNeijyUvwuo
Jlsh0wKe4Jq6bCmdxvy/bde84ceAQcuh2TKNov1S0tB38tRY9qSu2n6ZGpoMNcEe
CWuW+23jL1uAvt6Ljk2fTKdLR8iyXykfM0UMX2/4R2PMnJXK/dQqV/eeXTjpxtMk
UV4aIMcSS19D7HxICKOXOdZLdMMuXaFUnUMlGLBtXZz3n9lZoP7RtVIHoib8VBXZ
tY7xQrK6YNvwZ4SZZPuv/yr6YWP2+vBM2FUfXjzD+or6uYsej203a5q0RsOY+3Tp
6Yklr+mQNlrOtpMsHMSgrBUUZsAK1I95ALMVVqaq1hgf1cDvRIDHlOo4A7bjwNFw
eA3L1o4O1/E/IGp4v6+2Iukc9rIwm11sNr/wuD8dDkZTradUPH1iY6J5sxJNb2Nq
YdpCnQ/lNXj650+z9/G2omSA6DTTTOtXJPxKR+FOHZVKDpZYtF6TxKb0S79fINps
6ZlWIna5bUiXF1b6xad1x+vtyjNMgTvkg6mR+WQnvF57Ri8hucbtpv5wpA5bhYUQ
Fbz9VZF2nfMeIbXfTaWi
=Bvmr
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-01-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Some further updates for net-next:
* fix network-manager which was broken by the previous changes
* fix delete-station events, which were broken by me making the
genlmsg_end() mistake
* fix a timer left running during suspend in some race conditions
that would cause an annoying (but harmless) warning
* (less important, but in the tree already) remove 80+80 MHz rate
reporting since the spec doesn't distinguish it from 160 MHz;
as the bitrate they're both 160 MHz bandwidth
Signed-off-by: David S. Miller <davem@davemloft.net>
This tc action allows you to retrieve the connection tracking mark
This action has been used heavily by openwrt for a few years now.
There are known limitations currently:
doesn't work for initial packets, since we only query the ct table.
Fine given use case is for returning packets
no implicit defrag.
frags should be rare so fix later..
won't work for more complex tasks, e.g. lookup of other extensions
since we have no means to store results
we still have a 2nd lookup later on via normal conntrack path.
This shouldn't break anything though since skb->nfct isn't altered.
V2:
remove unnecessary braces (Jiri)
change the action identifier to 14 (Jiri)
Fix some stylistic issues caught by checkpatch
V3:
Move module params to bottom (Cong)
Get rid of tcf_hashinfo_init and friends and conform to newer API (Cong)
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels).
With this attribute, it's possible to interpret correctly all advertised
information (like IFLA_LINK, etc.).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to the netns where it is added (ie valid only into this
netns).
The main function (ie the one exported to other module), peernet2id(), allows to
get the id of a peer netns. If no id has been assigned by the user, this
function allocates one.
These ids will be used in netlink messages to point to a peer netns, for example
in case of a x-netns interface.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the ipip tunnel, the skb->queue_mapping is lost in ipip_rcv().
All skb will be queued to the same cell->napi_skbs. The
gro_cell_poll is pinned to one core under load. In production traffic,
we also see severe rx_dropped in the tunl iface and it is probably due to
this limit: skb_queue_len(&cell->napi_skbs) > netdev_max_backlog.
This patch is trying to alloc_percpu(struct gro_cell) and schedule
gro_cell_poll to process the skb in the same core.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.
This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to write
if (my_function(...))
/* error condition */
and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.
Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did
- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;
I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.
One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2015-01-16
Here are some more bluetooth & ieee802154 patches intended for 3.20:
- Refactoring & cleanups of ieee802154 & 6lowpan code
- Various fixes to the btmrvl driver
- Fixes for Bluetooth Low Energy Privacy feature handling
- Added build-time sanity checks for sockaddr sizes
- Fixes for Security Manager registration on LE-only controllers
- Refactoring of broken inquiry mode handling to a generic quirk
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch benefits from newly introduced switchdev notifier and uses it
to propagate fdb learn events from rocker driver to bridge. That avoids
direct function calls and possible use by other listeners (ovs).
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces new notifier for purposes of exposing events which happen
on switch driver side. The consumers of the event messages are mainly involved
masters, namely bridge and ovs.
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This action provides a possibility to exec custom BPF code.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the function hci_conn_change_link_key() that is not used anywhere.
This was partially found by using a static code analysis program called
cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
When we put our declared work task in the global workqueue with
schedule_delayed_work(), its delay parameter is always zero.
Therefore, we should define a regular work in rhashtable structure
instead of a delayed work.
By the way, we add a condition to check whether resizing functions
are NULL before cancelling the work, avoiding to cancel an
uninitialized work.
Lastly, while we wait for all work items we submitted before to run
to completion with cancel_delayed_work(), ht->mutex has been taken in
rhashtable_destroy(). Moreover, cancel_delayed_work() doesn't return
until all work items are accomplished, and when work items are
scheduled, the work's function - rht_deferred_worker() will be called.
However, as rht_deferred_worker() also needs to acquire the lock,
deadlock might happen at the moment as the lock is already held before.
So if the cancel work function is moved out of the lock covered scope,
this will avoid the deadlock.
Fixes: 97defe1 ("rhashtable: Per bucket locks & deferred expansion/shrinking")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have
* a lot of regulatory code changes to deal with the
way newer Intel devices handle this
* a change to drop packets while disconnecting from
an AP instead of trying to wait for them
* a new attempt at improving the tailroom accounting
to not kick in too much for performance reasons
* improvements in wireless link statistics
* many other small improvements and small fixes that
didn't seem necessary for 3.19 (e.g. in hwsim which
is testing only code)
-----BEGIN PGP SIGNATURE-----
iQIcBAABCAAGBQJUt7WEAAoJEDBSmw7B7bqrVBoP/2EViE62HMmXdqG1SZWz8q9o
Iigq8STC/sT2WCx1pYm+tKuVW4LD2O3mCriGNP8A3RwzDZ6H7sKJYb1gV6QCPV6f
4+yT5VSAB3D3lHmp/bbyNsmKCBQ5uS4LVgDrokrkbGpacDu94PYS5Wv9t3x6PBVB
5Xjky6g6A/pSuxTIstSO9k5xkzNjaB1TxvVRz/gJrGcFQVkDFSlVbuTHUVxs8p+p
k6mwY/2WYijZkswWZVQTJLQlF9vRI7PYkKs5m8gz4pjNU48oFJoyu4IP3Z1Xj/Sm
zgT1C9rgp0Du74HYO2niGAvLWgKajAZuW5hIacDndUPjYQQBLgGs/bCJGSntM+x9
XoOdPixdFPT/58ijyYZlmHc8rxPOd2kHsVbwGplp8f195S4VO04D+ejfOaoAUFwX
v/kMvO3XIFmEH1jjkDAC3OTcRMYVMuENyWl7pFzxHIzPeRiEpQUd9iSdM4yol0F2
ZyWvKud4U75Sh+aCiDIIBETtdfCRFe12hgKs4COYbI/UYkGPTPrNei/uisopdubT
JC+7pZOYdSgoX12yVi6ds6DmKE/ZpIQyhIK4wTWgVoszbnfdb9Mw7mJEThwNRjeK
JJPsbuty7u8HWjXzEqHLoTV3BFv1cgRSJc5Wt0zfME+LzD7iQpEpv+QBAguwwChD
Osn55Z3FnKEmBdGcOIje
=vaEW
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2015-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Here's a big pile of changes for this round.
We have
* a lot of regulatory code changes to deal with the
way newer Intel devices handle this
* a change to drop packets while disconnecting from
an AP instead of trying to wait for them
* a new attempt at improving the tailroom accounting
to not kick in too much for performance reasons
* improvements in wireless link statistics
* many other small improvements and small fixes that
didn't seem necessary for 3.19 (e.g. in hwsim which
is testing only code)
Conflicts:
drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c
Minor overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
RAW sockets with hdrinc suffer from contention on rt_uncached_lock
spinlock.
One solution is to use percpu lists, since most routes are destroyed
by the cpu that created them.
It is unclear why we even have to put these routes in uncached_list,
as all outgoing packets should be freed when a device is dismantled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: caacf05e5a ("ipv4: Properly purge netdev references on uncached routes.")
Signed-off-by: David S. Miller <davem@davemloft.net>
For some reason, we made the bandwidth separate flags, which
is rather confusing - a single rate cannot have different
bandwidths at the same time.
Change this to no longer be flags but use a separate field
for the bandwidth ('bw') instead.
While at it, add support for 5 and 10 MHz rates - these are
reported as regular legacy rates with their real bitrate,
but tagged as 5/10 now to make it easier to distinguish them.
In the nl80211 API, the flags are preserved, but the code
now can also clearly only set a single one of the flags.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
These rates are treated the same as 160 MHz in the spec, so
it makes no sense to distinguish them. As no driver uses them
yet, this is also not a problem, just remove them.
In the userspace API the field remains reserved to preserve
API and ABI.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
These rates are treated the same as 160 MHz in the spec,
so it makes no sense to distinguish them. As no driver
uses them yet, this is also not a problem, just remove
them.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pablo Neira Ayuso says:
====================
netfilter updates for net-next
The following patchset contains netfilter updates for net-next, just a
bunch of cleanups and small enhancement to selectively flush conntracks
in ctnetlink, more specifically the patches are:
1) Rise default number of buckets in conntrack from 16384 to 65536 in
systems with >= 4GBytes, patch from Marcelo Leitner.
2) Small refactor to save one level on indentation in xt_osf, from
Joe Perches.
3) Remove unnecessary sizeof(char) in nf_log, from Fabian Frederick.
4) Another small cleanup to remove redundant variable in nfnetlink,
from Duan Jiong.
5) Fix compilation warning in nfnetlink_cthelper on parisc, from
Chen Gang.
6) Fix wrong format in debugging for ctseqadj, from Gao feng.
7) Selective conntrack flushing through the mark for ctnetlink, patch
from Kristian Evensen.
8) Remove nf_ct_conntrack_flush_report() exported symbol now that is
not required anymore after the selective flushing patch, again from
Kristian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.
ovs-vsctl add-port br0 vxlan0 -- \
set Interface vxlan0 type=vxlan options:exts=gbp
The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.
The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
internally just like Geneve options but transported as nested Netlink
attributes to user space.
Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.
The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.
The OVS VXLAN port is kept unaware of extensions at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
OVS:
# ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
# ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/xen-netfront.c
Minor overlapping changes in xen-netfront.c, mostly to do
with some buffer management changes alongside the split
of stats into TX and RX.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Don't use uninitialized data in IPVS, from Dan Carpenter.
2) conntrack race fixes from Pablo Neira Ayuso.
3) Fix TX hangs with i40e, from Jesse Brandeburg.
4) Fix budget return from poll calls in dnet and alx, from Eric
Dumazet.
5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
Jaeger.
6) Fix bug introduced by conversion to list_head in TIPC retransmit
code, from Jon Paul Maloy.
7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
Khoroshilov.
8) Fix bridge build with INET disabled, from Arnd Bergmann.
9) Fix netlink array overrun for PROBE attributes in openvswitch, from
Thomas Graf.
10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
Prashant Sreedharan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
tg3: Release tp->lock before invoking synchronize_irq()
tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
openvswitch: packet messages need their own probe attribtue
i40e: adds FCoE configure option
cxgb4vf: Fix queue allocation for 40G adapter
netdevice: Add missing parentheses in macro
bridge: only provide proxy ARP when CONFIG_INET is enabled
neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
net: fec: fix MDIO bus assignement for dual fec SoC's
xen-netfront: use different locks for Rx and Tx stats
drivers: net: cpsw: fix multicast flush in dual emac mode
cxgb4vf: Initialize mdio_addr before using it
net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
MAINTAINERS: add me as ibmveth maintainer
tipc: fix bug in broadcast retransmit code
update ip-sysctl.txt documentation (v2)
net/at91_ether: prepare and unprepare clock
...
User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
and packet messages. This leads to an out-of-bounds access in
ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
OVS_PACKET_ATTR_MAX.
Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
while maintaining to be binary compatible with existing OVS binaries.
Fixes: 05da589 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tracked-down-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For example, one could conceivably call
for_each_netdev_in_bond_rcu(condition ? bond1 : bond2, slave)
and get an unexpected result.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull block layer fixes from Jens Axboe:
"The major part is an update to the NVMe driver, fixing various issues
around surprise removal and hung controllers. Most of that is from
Keith, and parts are simple blk-mq fixes or exports/additions of minor
functions to aid this effort, and parts are changes directly to the
NVMe driver.
Apart from the above, this contains:
- Small blk-mq change from me, killing an unused member of the
hardware queue structure.
- Small fix from Ming Lei, fixing up a few drivers that didn't
properly check for ERR_PTR() returns from blk_mq_init_queue()"
* 'for-linus' of git://git.kernel.dk/linux-block:
NVMe: Fix locking on abort handling
NVMe: Start and stop h/w queues on reset
NVMe: Command abort handling fixes
NVMe: Admin queue removal handling
NVMe: Reference count admin queue usage
NVMe: Start all requests
blk-mq: End unstarted requests on a dying queue
blk-mq: Allow requests to never expire
blk-mq: Add helper to abort requeued requests
blk-mq: Let drivers cancel requeue_work
blk-mq: Export if requests were started
blk-mq: Wake tasks entering queue on dying
blk-mq: get rid of ->cmd_size in the hardware queue
block: fix checking return value of blk_mq_init_queue
block: wake up waiters when a queue is marked dying
NVMe: Fix double free irq
blk-mq: Export freeze/unfreeze functions
blk-mq: Exit queue on alloc failure
Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.
Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.
Remote checksum offload for VXLAN is described in:
https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
With UDP checksums and Remote Checksum Offload
IPv4
Client
11.84% CPU utilization
Server
12.96% CPU utilization
9197 Mbps
IPv6
Client
12.46% CPU utilization
Server
14.48% CPU utilization
8963 Mbps
With UDP checksums, no remote checksum offload
IPv4
Client
15.67% CPU utilization
Server
14.83% CPU utilization
9094 Mbps
IPv6
Client
16.21% CPU utilization
Server
14.32% CPU utilization
9058 Mbps
No UDP checksums
IPv4
Client
15.03% CPU utilization
Server
23.09% CPU utilization
9089 Mbps
IPv6
Client
16.18% CPU utilization
Server
26.57% CPU utilization
8954 Mbps
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send the netdetect configuration information in the response to
NL8021_CMD_GET_WOWLAN commands. This includes the scan interval,
SSIDs to match and frequencies to scan.
Additionally, add the NL80211_WOWLAN_TRIG_NET_DETECT with
NL80211_ATTR_WOWLAN_TRIGGERS_SUPPORTED.
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
A self-managed device will sometimes need to set its regdomain synchronously.
Notably it should be set before usermode has a chance to query it. Expose
a new API to accomplish this which requires the RTNL.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
pfn_to_mfn(page_to_pfn(p)) is a common use case so add a generic
helper for it.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
it was agreement that WRITE_ONCE(x, val) is better than ASSIGN_ONCE(val, x)
Lets change that for 3.19 as 3.19 has no user yet, but the first users
will hit linux-next soon.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJUtXl3AAoJEBF7vIC1phx8QLsQAIthwLT9FC94YWYrKYkC1CT3
CY8gvjVXIki+CVs/a1qdcvuKXj4qLHBDO2ignhPvbqesmHyqIa010gsIT7sm6/d4
GyDY5AIzoeBFW8jYwVZGr+OMxjgZCTTopqB/vO//CI4aVHB2pE9Jw4cNA6Rl8uWo
SRXJXBWIw6nKElpTvqjV2PQXiOFqssfW5asjedzHfwpeJucjk3q59K54/RG19EUs
uqFVoElQMMiJ8QPlb91gYWH4AAdoc/G+YQkxlwt95638U6TKX/5FHc02glvgMNtI
EYt4/Aavn/NQLGUMeDF2IxptUCbr4F7GcSkYgYFeaYWPssgOAkMdtxyDT9SwwEt5
Oyq27CJ9TGPzdjRACteR+NUm6I5wOsy/mLX54BoumUGrApYtZzIvJDDEmIuWDzl8
OdrSR1+D9oi31gcL4qMJEIAe4s+ALJ/witFcdAoEXPugZsV4mkjBB7sEew1qrha9
WovyVsqlTnKtpNaoqPWH6jdBdsW49YsEVLVxsS5BT+48ZAYQEDXcAHJ1t65mQfsk
0c9hIwiOgt5Y1nNe0CMaQ6zLpzBRhf+wUAsg/fDZgB/P0O+Yg9XnGlNqL6WPV9xP
HK3PVgabzYcAr+ahEwvN3Ng5SE1FJ5xUv6kXhnVVjkOrfCMbgrckNHHVr4mvzK0n
E4VlqRGrSzfSqE7RuS2s
=GROK
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux
Pull WRITE_ONCE argument order change from Christian Borntraeger:
"As discussed on LKML[1] it was agreed that WRITE_ONCE(x, val) is
better than ASSIGN_ONCE(val, x)
Lets change that for 3.19 as 3.19 has no user yet, but the first users
will hit linux-next soon"
[1] http://marc.info/?l=linux-kernel&m=142081181707596
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Change ASSIGN_ONCE(val, x) to WRITE_ONCE(x, val)
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
tc code implicitly considers skb->protocol even in case of accelerated
vlan paths and expects vlan protocol type here. However, on rx path,
if the vlan header was already stripped, skb->protocol contains value
of next header. Similar situation is on tx path.
So for skbs that use skb->vlan_tci for tagging, use skb->vlan_proto instead.
Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>