linux/net
Jason Baron 790ba4566c tcp: set SOCK_NOSPACE under memory pressure
Under tcp memory pressure, calling epoll_wait() in edge triggered
mode after -EAGAIN, can result in an indefinite hang in epoll_wait(),
even when there is sufficient memory available to continue making
progress. The problem is that when __sk_mem_schedule() returns 0
under memory pressure, we do not set the SOCK_NOSPACE flag in the
tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since
SOCK_NOSPACE is used to trigger wakeups when incoming acks create
sufficient new space in the write queue, all outstanding packets
are acked, but we never wake up with the the EPOLLOUT that we are
expecting from epoll_wait().

This issue is currently limited to epoll() when used in edge trigger
mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE.
This is sufficient for poll()/select() and epoll() in level trigger
mode. However, in edge trigger mode, epoll() is relying on the write
path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we
can only call epoll_wait() after read/write return -EAGAIN. Thus, in
the case of the socket write, we are relying on the fact that
tcp_sendmsg()/network write paths are going to issue a wakeup for
us at some point in the future when we get -EAGAIN.

Normally, epoll() edge trigger works fine when we've exceeded the
sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when
we return -EAGAIN from the write path b/c we are over the tcp memory
limits and not b/c we are over the sndbuf, we are never going to get
another wakeup.

I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule()
will return 0, or failure more readily with SO_SNDBUF:

1) create socket and set SO_SNDBUF to N
2) add socket as edge trigger
3) write to socket and block in epoll on -EAGAIN
4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem

The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory()
when the socket is non-blocking. Note that SOCK_NOSPACE, in addition
to waking up outstanding waiters is also used to expand the size of
the sk->sndbuf. However, we will not expand it by setting it in this
case because tcp_should_expand_sndbuf(), ensures that no expansion
occurs when we are under tcp memory pressure.

Note that we could still hang if sk->sk_wmem_queue is 0, when we get
the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we
are waiting for and event that will never happen. I believe
that this case is harder to hit (and did not hit in my testing),
in that over the tcp 'soft' memory limits, we continue to guarantee a
minimum write buffer size. Perhaps, we could return -ENOSPC in this
case, or maybe we simply issue a wakeup in this case, such that we
keep retrying the write. Note that this case is not specific to
epoll() ET, but rather would affect blocking sockets as well. So I
view this patch as bringing epoll() edge-trigger into sync with the
current poll()/select()/epoll() level trigger and blocking sockets
behavior.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 17:38:36 -04:00
..
6lowpan 6lowpan: nhc: add other known rfc6282 compressions 2015-02-14 23:08:44 +01:00
9p 9p: patches for 4.1 merge window 2015-04-18 17:45:30 -04:00
802 net: Kill dev_rebuild_header 2015-03-02 16:43:41 -05:00
8021q vlan: implement ndo_get_iflink 2015-04-02 14:05:00 -04:00
appletalk appletalk: Use eth_<foo>_addr instead of memset 2015-03-03 17:01:37 -05:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-04-15 09:00:47 -07:00
ax25 ax25: Fix the build when CONFIG_INET is disabled 2015-03-05 13:17:39 -05:00
batman-adv dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
bluetooth Bluetooth: hidp: Fix regression with older userspace and flags validation 2015-04-18 11:01:08 -04:00
bridge ebtables: Use eth_proto_is_802_3 2015-05-05 19:24:42 -04:00
caif Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-03-20 18:51:09 -04:00
can can: introduce new raw socket option to join the given CAN filters 2015-04-01 11:28:22 +02:00
ceph crush: straw2 bucket type with an efficient 64-bit crush_ln() 2015-04-22 18:33:43 +03:00
core tcp: set SOCK_NOSPACE under memory pressure 2015-05-09 17:38:36 -04:00
dcb net/dcb: Add IEEE QCN attribute 2015-03-06 21:50:02 -05:00
dccp inet: fix possible panic in reqsk_queue_unlink() 2015-04-24 11:39:15 -04:00
decnet netfilter: Pass socket pointer down through okfn(). 2015-04-07 15:25:55 -04:00
dns_resolver Merge commit 'v3.16' into next 2014-10-01 00:44:04 +10:00
dsa net: dsa: Add lockdep class to tx queues to avoid lockdep splat 2015-05-09 16:05:54 -04:00
ethernet etherdev: Fix sparse error, make test usable by other functions 2015-05-05 19:24:42 -04:00
hsr net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface. 2015-03-01 13:40:23 -05:00
ieee802154 ieee802154: don't export static symbol 2015-03-14 17:11:31 +01:00
ipv4 tcp: add TCPWinProbe and TCPKeepAlive SNMP counters 2015-05-09 16:42:32 -04:00
ipv6 net: fix two sparse warnings introduced by IGMP/MLD parsing exports 2015-05-04 19:19:54 -04:00
ipx net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
irda Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-03-09 23:38:02 -04:00
iucv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-02 16:16:53 -04:00
key xfrm: simplify xfrm_address_t use 2015-03-31 13:58:35 -04:00
l2tp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-06 22:34:15 -04:00
lapb lapb: move EXPORT_SYMBOL after functions. 2014-10-24 15:51:42 -04:00
llc net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
mac80211 mac80211: add missing documentation for rate_ctrl_lock 2015-05-06 16:00:32 +02:00
mac802154 mac802154: cleanup concurrent check 2015-03-27 19:18:50 +01:00
mpls mpls: Prevent use of implicit NULL label as outgoing label 2015-04-22 14:24:54 -04:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf 2015-04-27 23:12:34 -04:00
netlabel netlink: implement nla_put_in_addr and nla_put_in6_addr 2015-03-31 13:58:35 -04:00
netlink net: fix crash in build_skb() 2015-04-25 15:49:49 -04:00
netrom net: Kill dev_rebuild_header 2015-03-02 16:43:41 -05:00
nfc nfc: Fix portid type in urelease_work 2015-04-13 16:35:16 -04:00
openvswitch openvswitch: Use eth_proto_is_802_3 2015-05-05 19:24:42 -04:00
packet af_packet: pass checksum validation status to the user 2015-03-23 22:01:28 -04:00
phonet net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-14 15:44:14 -04:00
rfkill Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
rose net: Kill dev_rebuild_header 2015-03-02 16:43:41 -05:00
rxrpc new helper: msg_data_left() 2015-04-11 15:53:35 -04:00
sched sch_choke: Use flow_keys_digest 2015-05-04 00:09:09 -04:00
sctp sctp: avoid to repeatedly declare external variables 2015-03-25 11:40:16 -04:00
sunrpc NFS client updates for Linux 4.1 2015-04-26 17:33:59 -07:00
switchdev switchdev: fix stp update API to work with layered netdevices 2015-03-23 16:44:56 -04:00
tipc tipc: send explicit not supported error in nl compat 2015-05-09 16:40:03 -04:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-27 14:05:19 -07:00
vmw_vsock net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
wimax wimax: convert printk to pr_foo() 2014-10-07 20:28:44 -04:00
wireless cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA 2015-05-06 15:50:02 +02:00
x25 net: Remove iocb argument from sendmsg and recvmsg 2015-03-02 13:06:31 -05:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-04-14 15:44:14 -04:00
Kconfig kconfig: use bool instead of boolean for type definition attributes 2015-01-07 13:08:04 +01:00
Makefile mpls: Refactor how the mpls module is built 2015-03-04 00:26:06 -05:00
compat.c net: switch importing msghdr from userland to {compat_,}import_iovec() 2015-04-09 00:02:26 -04:00
socket.c VFS: net/: d_inode() annotations 2015-04-15 15:06:56 -04:00
sysctl_net.c