Without this deallocate won't work properly due to the mismatch
of the bio/request size and the actual payload size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
This patch performs dma sync operations on nvme_command
and nvme_completion.
nvme_command is synced
(a) on receiving of the recv queue completion for cpu access.
(b) before posting recv wqe back to rdma adapter for device access.
nvme_completion is synced
(a) on receiving of the recv queue completion of associated
nvme_command for cpu access.
(b) before posting send wqe to rdma adapter for device access.
This patch is generated for git://git.infradead.org/nvme-fabrics.git
Branch: nvmf-4.10
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
We only need to call delete_ctrl once, so given that both
keep-alive timeout and any other fatal error can trigger it,
just make sure we only call delete_ctrl once.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Make sure they are not running and we can free the controller
safely.
Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
No reason for them to be kept around if we are
deleting the subsystem, so instead of passively
wait for the host to disconnect, actively delete
the controllers.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Correct logic in disconnect queue LS handling.
Rework so that queue searching and error reporting is above the
section to send back a ls rjt
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
The tc could return NET_XMIT_CN as one congestion notification, but
it does not mean the packet is lost. Other modules like ipvlan,
macvlan, and others treat NET_XMIT_CN as success too.
So batman-adv should handle NET_XMIT_CN also as NET_XMIT_SUCCESS.
Signed-off-by: Gao Feng <gfree.wind@gmail.com>
[sven@narfation.org: Moved NET_XMIT_CN handling to batadv_send_skb_packet]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
It could decrease one condition check to collect some statements in the
first condition block.
Signed-off-by: Gao Feng <gfree.wind@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
09748a22f4 ("batman-adv: add generic netlink family for batman-adv")
introduced the new batman_adv.h which describes the netlink attributes and
commands of batman-adv. But the Kbuild entry to install the header was not
added.
All currently known tools ship their own copy of batman_adv.h but it should
be installed anyway to later be able to migrate to the system batman_adv.h.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The bridge loop avoidance (BLA) feature of batman-adv sends packets to
probe for Mesh/LAN packet loops. Those packets are not sent by real
clients and should therefore not be added to the translation table (TT).
Signed-off-by: Simon Wunderlich <simon.wunderlich@open-mesh.com>
If we try to allocate memory pages to back an xfs_buf that we're trying
to read, it's possible that we'll be so short on memory that the page
allocation fails. For a blocking read we'll just wait, but for
readahead we simply dump all the pages we've collected so far.
Unfortunately, after dumping the pages we neglect to clear the
_XBF_PAGES state, which means that the subsequent call to xfs_buf_free
thinks that b_pages still points to pages we own. It then double-frees
the b_pages pages.
This results in screaming about negative page refcounts from the memory
manager, which xfs oughtn't be triggering. To reproduce this case,
mount a filesystem where the size of the inodes far outweighs the
availalble memory (a ~500M inode filesystem on a VM with 300MB memory
did the trick here) and run bulkstat in parallel with other memory
eating processes to put a huge load on the system. The "check summary"
phase of xfs_scrub also works for this purpose.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
The only caller of this new function is inside of an #ifdef checking
for CONFIG_BRIDGE_IGMP_SNOOPING, so we should move the implementation
there too, in order to avoid this harmless warning:
net/bridge/br_forward.c:177:13: error: 'maybe_deliver_addr' defined but not used [-Werror=unused-function]
Fixes: 6db6f0eae6 ("bridge: multicast to unicast")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently used len instead of prefix_len for the strncmp() in
fdinfo on the prog_tag. It still worked as we matched on the correct
output line also with first 8 instead of 10 chars, but lets fix it
properly to use the intended length.
Fixes: 62b6466026 ("bpf: add prog tag test case to bpf selftests")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafał Miłecki says:
====================
net-next: Broadcom PHY driver cleanup
I will probably need to use broadcom.ko for PHY connected to interface
of bgmac supported device so I started looking at it willing to
understand it better.
I found AUXCTL part of the driver / lib a bit confusing and hard to read
so I'm trying to clean it up a bit. I hope this patchset makes following
AUXCTL operations much easier making it clear which defines are for
registers and which for values.
There is no functional change in this pachset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Use 0x%02x format for register number. This follows some other
defines and makes it easier to distinct register from values.
2) Put register define above values and sort the values. It makes
reading header code easier.
3) Use 0x%04x format for all values. It's about consistency with other
values (and most of the header) not a personal preference.
4) Separate define for reading shift value with an extre empty line.
It's user for all AUXCTL registers in a bcm54xx_auxctl_read.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had two defines for the same bit (both were used with the
MII_BCM54XX_AUXCTL_SHDWSEL_MISC register).
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting with commit 5b4e290051 ("net: phy: broadcom: add
bcm54xx_auxctl_read") we have a reading helper so use it and avoid code
duplication.
It also means we don't need MII_BCM54XX_AUXCTL_SHDWSEL_MISC define as
it's the same as MII_BCM54XX_AUXCTL_SHDWSEL_MISC just for reading needs
(same value shifted by 12 bits).
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix reference count handling on fragmentation error, by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAliI0D8WHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoU2WD/9EdJuxSK0++Ka/sdXNEmy931dl
ojXukizbSIg5+vEs0GCP+E5btvCQkiz1e43AAljHjE4FTv4nu9BeHs7g4msQzZt1
7Pgy7A5wOz3UP/GN5QxEStGXYNmMeeKuaLklwxx+I619pMBX83bwlcfi8Xf40BWP
twmeyC11TCXbyyR7sH8nbDUiuZdP4soO3yr2WzTndHYui5UKTPlQ/5/VJLFtnIYw
ZDGYqAy7cg7wHO4khd2HaWuetIW4QIk59rqJX9pmwhDADuKL4XE9O9K3KawFH873
lYv226vc6d7Nc6mBfD0zDPXKImOsvi3padtVgXr2AYI8bBlh7eDp9NUFLITly8Jo
GIr8ABp5u6ZbN8A+16C5yHj5BZArQTe5EjZCS9yLLZQ0a9dXp4ZlEQuHSN9BJo05
CWYbpoj5Hunooh2EwOYfwWQa7kbDlDh/q7+qIqJcqhd+KT9cCHvPH3bsmi8gZMrh
fxyMouSX5LQmwnAgs/r4KUQ/5eCcf81o+SVTi/yvEf1Y9pfIqFna2IU/oREpjwY5
csFrY8Czc/O2E20+dzrCJAKK+Sa1U+WnUGlguqIQFIue/mMcOg6vvRcj8K2uz/7x
jsnfjA0ZsK2mRVpaWN/36RVcNyMQhrM35sVv7mnvwCWAViGBaPOzOzc+bv4fdzq0
8KOz8mx3qqU3ZIRKkw==
=xS4v
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20170125' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- fix reference count handling on fragmentation error, by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 17bedab272 ("bpf: xdp: Allow head adjustment in XDP prog")
added a new XDP helper to prepend and remove data from a frame.
Make virtio_net reject programs making use of this helper until
proper support is added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the small buffer case during driver unload we currently use
put_page instead of dev_kfree_skb. Resolve this by adding a check
for virtnet mode when checking XDP queue type. Also name the
function so that the code reads correctly to match the additional
check.
Fixes: bb91accf27 ("virtio-net: XDP support for small buffers")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hayes Wang says:
====================
r8152: fix scheduling napi
v3:
simply the argument for patch #3. Replace &tp->napi with napi.
v2:
Add smp_mb__after_atomic() for patch #1.
v1:
Scheduling the napi during the following periods would let it be ignored.
And the events wouldn't be handled until next napi_schedule() is called.
1. after napi_disable and before napi_enable().
2. after all actions of napi function is completed and before calling
napi_complete().
If no next napi_schedule() is called, tx or rx would stop working.
In order to avoid these situations, the followings solutions are applied.
1. prevent start_xmit() from calling napi_schedule() during runtime suspend
or after napi_disable().
2. re-schedule the napi for tx if it is necessary.
3. check if any rx is finished or not after napi_enable().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Schedule the napi after napi_enable() for rx, if it is necessary.
If the rx is completed when napi is disabled, the sheduling of napi
would be lost. Then, no one handles the rx packet until next napi
is scheduled.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Re-schedule napi after napi_complete() for tx, if it is necessay.
In r8152_poll(), if the tx is completed after tx_bottom() and before
napi_complete(), the scheduling of napi would be lost. Then, no
one handles the next tx until the next napi_schedule() is called.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop the tx when the napi is disabled to prevent napi_schedule() is
called.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adjust the setting of the flag of SELECTIVE_SUSPEND to prevent start_xmit()
from calling napi_schedule() directly during runtime suspend.
After calling napi_disable() or clearing the flag of WORK_ENABLE,
scheduling the napi is useless.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 3846fd9b86.
There were some precursor commits missing for this around connector
locking, we should probably merge Lyude's nouveau avoid the problem patch.
Commit 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid
lockdep splat") removed the netif_device_detach() call done in
dsa_slave_suspend() which is necessary, and paired with a corresponding
netif_device_attach(), bring it back.
Fixes: 448b4482c6 ("net: dsa: Add lockdep class to tx queues to avoid lockdep splat")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patches have moved the temperature sensor code into the
Marvell PHYs. A few now dead references to NET_DSA_HWMON were left
behind. Go reap them.
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
PIO buffer allocation can fail for two valid reasons:
- we've run out of them (results in -ENOSPC)
- the NIC configuration doesn't support them (results in -EPERM)
Since both these failures are expected netif_err is excessive.
Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham says:
====================
thunderx: More ethtool support and BGX configuration changes
These patches adds support to set queue sizes from ethtool and changes
the way serdes lane configuration is done by BGX driver on 81/83xx
platforms.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For DLMs and SLMs on 80/81/83xx, many lane configurations
across different boards are coming up. Also kernel doesn't have
any way to identify board type/info and since firmware does,
just get rid of figuring out lane to serdes config and take
whatever has been programmed by low level firmware.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support to set Rx/Tx queue sizes from ethtool. Fixes
an issue with retrieving queue size. Also sets SQ's CQ_LIMIT
based on configured Tx queue size such that HW doesn't process
SQEs when there is no sufficient space in CQ.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert Uytterhoeven says:
====================
net: phy: leds: Fix truncated LED trigger names and crashes
I started seeing crashes during s2ram and poweroff on all my ARM boards,
like:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
...
[<c04116d4>] (__list_del_entry_valid) from [<c05e8948>] (led_trigger_unregister+0x34/0xcc)
[<c05e8948>] (led_trigger_unregister) from [<c05336c4>] (phy_led_triggers_unregister+0x28/0x34)
[<c05336c4>] (phy_led_triggers_unregister) from [<c0531d44>] (phy_detach+0x30/0x74)
[<c0531d44>] (phy_detach) from [<c0538bdc>] (sh_eth_close+0x64/0x9c)
[<c0538bdc>] (sh_eth_close) from [<c04d4ce0>] (dpm_run_callback+0x48/0xc8)
or:
list_del corruption. prev->next should be dede6540, but was 2e323931
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:52!
...
[<c02f6d70>] (__list_del_entry_valid) from [<c0425168>] (led_trigger_unregister+0x34/0xcc)
[<c0425168>] (led_trigger_unregister) from [<c03a05a0>] (phy_led_triggers_unregister+0x28/0x34)
[<c03a05a0>] (phy_led_triggers_unregister) from [<c039ec04>] (phy_detach+0x30/0x74)
[<c039ec04>] (phy_detach) from [<c03a4fc0>] (sh_eth_close+0x6c/0xa4)
[<c03a4fc0>] (sh_eth_close) from [<c0483234>] (__dev_close_many+0xac/0xd0)
As the only clue was a kernel message like
sh-eth ee700000.ethernet eth0: No phy led trigger registered for speed(100)
I had to bisected this, leading to commit 4567d686f5 ("phy:
increase size of MII_BUS_ID_SIZE and bus_id"). Reverting that commit
fixed the issue.
More investigation revealed the crashes are due to the combination of
two things:
- Truncated LED trigger names, leading to duplicate names, and
registration failures,
- Bad error handling in case of registration failures.
Both are fixed by this patch series.
Changes compared to v1:
- Add Reviewed-by,
- New patch "net: phy: leds: Break dependency of phy.h on
phy_led_triggers.h",
- Drop moving the include of <linux/phy_led_triggers.h>, as
<linux/phy.h> no longer includes it,
- #include <linux/phy.h> from <linux/phy_led_triggers.h>.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 4567d686f5 ("phy: increase size of MII_BUS_ID_SIZE and
bus_id") increased the size of MII bus IDs, but forgot to update the
private definition in <linux/phy_led_triggers.h>.
This may cause:
1. Truncation of LED trigger names,
2. Duplicate LED trigger names,
3. Failures registering LED triggers,
4. Crashes due to bad error handling in the LED trigger failure path.
To fix this, and prevent the definitions going out of sync again in the
future, let the PHY LED trigger code use the existing MII_BUS_ID_SIZE
definition.
Example:
- Before I had triggers "ee700000.etherne:01:100Mbps" and
"ee700000.etherne:01:10Mbps",
- After the increase of MII_BUS_ID_SIZE, both became
"ee700000.ethernet-ffffffff:01:" => FAIL,
- Now, the triggers are "ee700000.ethernet-ffffffff:01:100Mbps" and
"ee700000.ethernet-ffffffff:01:10Mbps", which are unique again.
Fixes: 4567d686f5 ("phy: increase size of MII_BUS_ID_SIZE and bus_id")
Fixes: 2e0bc452f4 ("net: phy: leds: add support for led triggers on phy link state change")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
<linux/phy.h> includes <linux/phy_led_triggers.h>, which is not really
needed. Drop the include from <linux/phy.h>, and add it to all users
that didn't include it explicitly.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
phy_attach_direct() ignores errors returned by
phy_led_triggers_register(). I think that's OK, as LED triggers can be
considered a non-critical feature.
However, this causes problems later:
- phy_led_trigger_change_speed() will access the array
phy_device.phy_led_triggers, which has been freed in the error path
of phy_led_triggers_register(), which may lead to a crash.
- phy_led_triggers_unregister() will access the same array, leading to
crashes during s2ram or poweroff, like:
Unable to handle kernel NULL pointer dereference at virtual address
00000000
...
[<c04116d4>] (__list_del_entry_valid) from [<c05e8948>] (led_trigger_unregister+0x34/0xcc)
[<c05e8948>] (led_trigger_unregister) from [<c05336c4>] (phy_led_triggers_unregister+0x28/0x34)
[<c05336c4>] (phy_led_triggers_unregister) from [<c0531d44>] (phy_detach+0x30/0x74)
[<c0531d44>] (phy_detach) from [<c0538bdc>] (sh_eth_close+0x64/0x9c)
[<c0538bdc>] (sh_eth_close) from [<c04d4ce0>] (dpm_run_callback+0x48/0xc8)
or:
list_del corruption. prev->next should be dede6540, but was 2e323931
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:52!
...
[<c02f6d70>] (__list_del_entry_valid) from [<c0425168>] (led_trigger_unregister+0x34/0xcc)
[<c0425168>] (led_trigger_unregister) from [<c03a05a0>] (phy_led_triggers_unregister+0x28/0x34)
[<c03a05a0>] (phy_led_triggers_unregister) from [<c039ec04>] (phy_detach+0x30/0x74)
[<c039ec04>] (phy_detach) from [<c03a4fc0>] (sh_eth_close+0x6c/0xa4)
[<c03a4fc0>] (sh_eth_close) from [<c0483234>] (__dev_close_many+0xac/0xd0)
To fix this, clear phy_device.phy_num_led_triggers in the error path of
phy_led_triggers_register() fails.
Note that the "No phy led trigger registered for speed" message will
still be printed on link speed changes, which is a good cue that
something went wrong with the LED triggers.
Fixes: 2e0bc452f4 ("net: phy: leds: add support for led triggers on phy link state change")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the binding was defined, I was not aware that mt2701 was an earlier
version of the SoC. For sake of consistency, the ethernet driver should
use mt2701 inside the compat string as this is the earliest SoC with the
ethernet core.
The ethernet driver is currently of no real use until we finish and
upstream the DSA driver. There are no users of this binding yet. It should
be safe to fix this now before it is too late and we need to provide
backward compatibility for the mt7623-eth compat string.
Reported-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the binding was defined, I was not aware that mt2701 was an earlier
version of the SoC. For sake of consistency, the ethernet driver should
use mt2701 inside the compat string as this is the earliest SoC with the
ethernet core.
The ethernet driver is currently of no real use until we finish and
upstream the DSA driver. There are no users of this binding yet. It should
be safe to fix this now before it is too late and we need to provide
backward compatibility for the mt7623-eth compat string.
Reported-by: Sean Wang <sean.wang@mediatek.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without TFO, any subsequent connect() call after a successful one returns
-1 EISCONN. The last API update ensured that __inet_stream_connect() can
return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
indicate that the connection is now in progress. Unfortunately since this
function is used both for connect() and sendmsg(), it has the undesired
side effect of making connect() now return -1 EINPROGRESS as well after
a successful call, while at the same time poll() returns POLLOUT. This
can confuse some applications which happen to call connect() and to
check for -1 EISCONN to ensure the connection is usable, and for which
EINPROGRESS indicates a need to poll, causing a loop.
This problem was encountered in haproxy where a call to connect() is
precisely used in certain cases to confirm a connection's readiness.
While arguably haproxy's behaviour should be improved here, it seems
important to aim at a more robust behaviour when the goal of the new
API is to make it easier to implement TFO in existing applications.
This patch simply ensures that we preserve the same semantics as in
the non-TFO case on the connect() syscall when using TFO, while still
returning -1 EINPROGRESS on sendmsg(). For this we simply tell
__inet_stream_connect() whether we're doing a regular connect() or in
fact connecting for a sendmsg() call.
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang says:
====================
net/tcp-fastopen: Add new userspace API support
The patch series is to add support for new userspace API for TCP fastopen
sockets.
In the current code, user has to call sendto()/sendmsg() with special flag
MSG_FASTOPEN for TCP fastopen sockets. This API is quite different from the
normal TCP socket API and can be cumbersome for applications to make use
fastopen sockets.
So this new patch introduces a new way of using TCP fastopen sockets which
is similar to normal TCP sockets with a new sockopt TCP_FASTOPEN_CONNECT.
More details about it is described in the third patch.
(First 2 patches are preparations for the third patch.)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
alternative way to perform Fast Open on the active side (client). Prior
to this patch, a client needs to replace the connect() call with
sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
to use Fast Open: these socket operations are often done in lower layer
libraries used by many other applications. Changing these libraries
and/or the socket call sequences are not trivial. A more convenient
approach is to perform Fast Open by simply enabling a socket option when
the socket is created w/o changing other socket calls sequence:
s = socket()
create a new socket
setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
newly introduced sockopt
If set, new functionality described below will be used.
Return ENOTSUPP if TFO is not supported or not enabled in the
kernel.
connect()
With cookie present, return 0 immediately.
With no cookie, initiate 3WHS with TFO cookie-request option and
return -1 with errno = EINPROGRESS.
write()/sendmsg()
With cookie present, send out SYN with data and return the number of
bytes buffered.
With no cookie, and 3WHS not yet completed, return -1 with errno =
EINPROGRESS.
No MSG_FASTOPEN flag is needed.
read()
Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
write() is not called yet.
Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
established but no msg is received yet.
Return number of bytes read if socket is established and there is
msg received.
The new API simplifies life for applications that always perform a write()
immediately after a successful connect(). Such applications can now take
advantage of Fast Open by merely making one new setsockopt() call at the time
of creating the socket. Nothing else about the application's socket call
sequence needs to change.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
will eventually get called when sk is released. No need to handle it in
the protocol specific connect call.
This is also to make the code path consistent with ipv4.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the cookie check logic in tcp_send_syn_data() into a function.
This function will be called else where in later changes.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan says:
====================
bnxt_en: Fix RTNL lock usage in bnxt_sp_task().
There are 2 function calls from bnxt_sp_task() that have buggy RTNL
usage. These 2 functions take RTNL lock under some conditions, but
some callers (such as open, ethtool) have already taken RTNL. These
3 patches fix the issue by making it clear that callers must take
RTNL. If the caller is bnxt_sp_task() which does not automatically
take RTNL, we add a common scheme for bnxt_sp_task() to call these
functions properly under RTNL.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bnxt_get_port_module_status() calls bnxt_update_link() which expects
RTNL to be held. In bnxt_sp_task() that does not hold RTNL, we need to
call it with a prior call to bnxt_rtnl_lock_sp() and the call needs to
be moved to the end of bnxt_sp_task().
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bnxt_update_link() is called from multiple code paths. Most callers,
such as open, ethtool, already hold RTNL. Only the caller bnxt_sp_task()
does not. So it is a bug to take RTNL inside bnxt_update_link().
Fix it by removing the RTNL inside bnxt_update_link(). The function
now expects the caller to always hold RTNL.
In bnxt_sp_task(), call bnxt_rtnl_lock_sp() before calling
bnxt_update_link(). We also need to move the call to the end of
bnxt_sp_task() since it will be clearing the BNXT_STATE_IN_SP_TASK bit.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In bnxt_sp_task(), we set a bit BNXT_STATE_IN_SP_TASK so that bnxt_close()
will synchronize and wait for bnxt_sp_task() to finish. Some functions
in bnxt_sp_task() require us to clear BNXT_STATE_IN_SP_TASK and then
acquire rtnl_lock() to prevent race conditions.
There are some bugs related to this logic. This patch refactors the code
to have common bnxt_rtnl_lock_sp() and bnxt_rtnl_unlock_sp() to handle
the RTNL and the clearing/setting of the bit. Multiple functions will
need the same logic. We also need to move bnxt_reset() to the end of
bnxt_sp_task(). Functions that clear BNXT_STATE_IN_SP_TASK must be the
last functions to be called in bnxt_sp_task(). The common scheme will
handle the condition properly.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>