linux/net/core/stream.c

213 lines
5.2 KiB
C
Raw Normal View History

/*
* SUCS NET3:
*
* Generic stream handling routines. These are generic for most
* protocols. Even IP. Tonight 8-).
* This is used because TCP, LLC (others too) layer all have mostly
* identical sendmsg() and recvmsg() code.
* So we (will) share it here.
*
* Authors: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
* (from old tcp.c code)
* Alan Cox <alan@lxorguk.ukuu.org.uk> (Borrowed comments 8-))
*/
#include <linux/module.h>
#include <linux/net.h>
#include <linux/signal.h>
#include <linux/tcp.h>
#include <linux/wait.h>
#include <net/sock.h>
/**
* sk_stream_write_space - stream socket write_space callback.
[PATCH] DocBook: changes and extensions to the kernel documentation I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:59:25 +08:00
* @sk: socket
*
* FIXME: write proper description
*/
void sk_stream_write_space(struct sock *sk)
{
struct socket *sock = sk->sk_socket;
net: sock_def_readable() and friends RCU conversion sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-29 19:01:49 +08:00
struct socket_wq *wq;
if (sk_stream_is_writeable(sk) && sock) {
clear_bit(SOCK_NOSPACE, &sock->flags);
net: sock_def_readable() and friends RCU conversion sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-29 19:01:49 +08:00
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (skwq_has_sleeper(wq))
net: sock_def_readable() and friends RCU conversion sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-29 19:01:49 +08:00
wake_up_interruptible_poll(&wq->wait, POLLOUT |
POLLWRNORM | POLLWRBAND);
net: sock_def_readable() and friends RCU conversion sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-29 19:01:49 +08:00
if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(wq, SOCK_WAKE_SPACE, POLL_OUT);
net: sock_def_readable() and friends RCU conversion sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we need two atomic operations (and associated dirtying) per incoming packet. RCU conversion is pretty much needed : 1) Add a new structure, called "struct socket_wq" to hold all fields that will need rcu_read_lock() protection (currently: a wait_queue_head_t and a struct fasync_struct pointer). [Future patch will add a list anchor for wakeup coalescing] 2) Attach one of such structure to each "struct socket" created in sock_alloc_inode(). 3) Respect RCU grace period when freeing a "struct socket_wq" 4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct socket_wq" 5) Change sk_sleep() function to use new sk->sk_wq instead of sk->sk_sleep 6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside a rcu_read_lock() section. 7) Change all sk_has_sleeper() callers to : - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock) - Use wq_has_sleeper() to eventually wakeup tasks. - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock) 8) sock_wake_async() is modified to use rcu protection as well. 9) Exceptions : macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq" instead of dynamically allocated ones. They dont need rcu freeing. Some cleanups or followups are probably needed, (possible sk_callback_lock conversion to a spinlock for example...). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-29 19:01:49 +08:00
rcu_read_unlock();
}
}
EXPORT_SYMBOL(sk_stream_write_space);
/**
* sk_stream_wait_connect - Wait for a socket to get into the connected state
[PATCH] DocBook: changes and extensions to the kernel documentation I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:59:25 +08:00
* @sk: sock to wait on
* @timeo_p: for how long to wait
*
* Must be called with the socket locked.
*/
int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
{
struct task_struct *tsk = current;
DEFINE_WAIT(wait);
int done;
do {
int err = sock_error(sk);
if (err)
return err;
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV))
return -EPIPE;
if (!*timeo_p)
return -EAGAIN;
if (signal_pending(tsk))
return sock_intr_errno(*timeo_p);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
sk->sk_write_pending++;
done = sk_wait_event(sk, timeo_p,
!sk->sk_err &&
!((1 << sk->sk_state) &
~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
finish_wait(sk_sleep(sk), &wait);
sk->sk_write_pending--;
} while (!done);
return 0;
}
EXPORT_SYMBOL(sk_stream_wait_connect);
/**
* sk_stream_closing - Return 1 if we still have things to send in our buffers.
[PATCH] DocBook: changes and extensions to the kernel documentation I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:59:25 +08:00
* @sk: socket to verify
*/
static inline int sk_stream_closing(struct sock *sk)
{
return (1 << sk->sk_state) &
(TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK);
}
void sk_stream_wait_close(struct sock *sk, long timeout)
{
if (timeout) {
DEFINE_WAIT(wait);
do {
prepare_to_wait(sk_sleep(sk), &wait,
TASK_INTERRUPTIBLE);
if (sk_wait_event(sk, &timeout, !sk_stream_closing(sk)))
break;
} while (!signal_pending(current) && timeout);
finish_wait(sk_sleep(sk), &wait);
}
}
EXPORT_SYMBOL(sk_stream_wait_close);
/**
* sk_stream_wait_memory - Wait for more memory for a socket
[PATCH] DocBook: changes and extensions to the kernel documentation I have recompiled Linux kernel 2.6.11.5 documentation for me and our university students again. The documentation could be extended for more sources which are equipped by structured comments for recent 2.6 kernels. I have tried to proceed with that task. I have done that more times from 2.6.0 time and it gets boring to do same changes again and again. Linux kernel compiles after changes for i386 and ARM targets. I have added references to some more files into kernel-api book, I have added some section names as well. So please, check that changes do not break something and that categories are not too much skewed. I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved by kernel convention. Most of the other changes are modifications in the comments to make kernel-doc happy, accept some parameters description and do not bail out on errors. Changed <pid> to @pid in the description, moved some #ifdef before comments to correct function to comments bindings, etc. You can see result of the modified documentation build at http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz Some more sources are ready to be included into kernel-doc generated documentation. Sources has been added into kernel-api for now. Some more section names added and probably some more chaos introduced as result of quick cleanup work. Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Signed-off-by: Martin Waitz <tali@admingilde.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 23:59:25 +08:00
* @sk: socket to wait for memory
* @timeo_p: for how long
*/
int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
{
int err = 0;
long vm_wait = 0;
long current_timeo = *timeo_p;
tcp: set SOCK_NOSPACE under memory pressure Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-06 23:52:23 +08:00
bool noblock = (*timeo_p ? false : true);
DEFINE_WAIT(wait);
if (sk_stream_memory_free(sk))
current_timeo = vm_wait = (prandom_u32() % (HZ / 5)) + 2;
while (1) {
sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto do_error;
tcp: set SOCK_NOSPACE under memory pressure Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-06 23:52:23 +08:00
if (!*timeo_p) {
if (noblock)
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
goto do_nonblock;
tcp: set SOCK_NOSPACE under memory pressure Under tcp memory pressure, calling epoll_wait() in edge triggered mode after -EAGAIN, can result in an indefinite hang in epoll_wait(), even when there is sufficient memory available to continue making progress. The problem is that when __sk_mem_schedule() returns 0 under memory pressure, we do not set the SOCK_NOSPACE flag in the tcp write paths (tcp_sendmsg() or do_tcp_sendpages()). Then, since SOCK_NOSPACE is used to trigger wakeups when incoming acks create sufficient new space in the write queue, all outstanding packets are acked, but we never wake up with the the EPOLLOUT that we are expecting from epoll_wait(). This issue is currently limited to epoll() when used in edge trigger mode, since 'tcp_poll()', does in fact currently set SOCK_NOSPACE. This is sufficient for poll()/select() and epoll() in level trigger mode. However, in edge trigger mode, epoll() is relying on the write path to set SOCK_NOSPACE. EPOLL(7) says that in edge-trigger mode we can only call epoll_wait() after read/write return -EAGAIN. Thus, in the case of the socket write, we are relying on the fact that tcp_sendmsg()/network write paths are going to issue a wakeup for us at some point in the future when we get -EAGAIN. Normally, epoll() edge trigger works fine when we've exceeded the sk->sndbuf because in that case we do set SOCK_NOSPACE. However, when we return -EAGAIN from the write path b/c we are over the tcp memory limits and not b/c we are over the sndbuf, we are never going to get another wakeup. I can reproduce this issue, using SO_SNDBUF, since __sk_mem_schedule() will return 0, or failure more readily with SO_SNDBUF: 1) create socket and set SO_SNDBUF to N 2) add socket as edge trigger 3) write to socket and block in epoll on -EAGAIN 4) cause tcp mem pressure via: echo "<small val>" > net.ipv4.tcp_mem The fix here is simply to set SOCK_NOSPACE in sk_stream_wait_memory() when the socket is non-blocking. Note that SOCK_NOSPACE, in addition to waking up outstanding waiters is also used to expand the size of the sk->sndbuf. However, we will not expand it by setting it in this case because tcp_should_expand_sndbuf(), ensures that no expansion occurs when we are under tcp memory pressure. Note that we could still hang if sk->sk_wmem_queue is 0, when we get the -EAGAIN. In this case the SOCK_NOSPACE bit will not help, since we are waiting for and event that will never happen. I believe that this case is harder to hit (and did not hit in my testing), in that over the tcp 'soft' memory limits, we continue to guarantee a minimum write buffer size. Perhaps, we could return -ENOSPC in this case, or maybe we simply issue a wakeup in this case, such that we keep retrying the write. Note that this case is not specific to epoll() ET, but rather would affect blocking sockets as well. So I view this patch as bringing epoll() edge-trigger into sync with the current poll()/select()/epoll() level trigger and blocking sockets behavior. Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-06 23:52:23 +08:00
}
if (signal_pending(current))
goto do_interrupted;
sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
if (sk_stream_memory_free(sk) && !vm_wait)
break;
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
sk->sk_write_pending++;
net: Fix the condition passed to sk_wait_event() This patch fixes the condition (3rd arg) passed to sk_wait_event() in sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory() causes the following soft lockup in tcp_sendmsg() when the global tcp memory pool has exhausted. >>> snip <<< localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429] localhost kernel: CPU 3: localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200 localhost kernel: localhost kernel: Call Trace: localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0 localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140 localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130 localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40 localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170 localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190 localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90 localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83 >>> snip <<< What is happening is, that the sk_wait_event() condition passed from sk_stream_wait_memory() evaluates to true for the case of tcp global memory exhaustion. This is because both sk_stream_memory_free() and vm_wait are true which causes sk_wait_event() to *not* call schedule_timeout(). Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping. This causes the caller to again try allocation, which again fails and again calls sk_stream_wait_memory(), and so on. [ Bug introduced by commit c1cbe4b7ad0bc4b1d98ea708a3fecb7362aa4088 ("[NET]: Avoid atomic xchg() for non-error case") -DaveM ] Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-03 07:45:06 +08:00
sk_wait_event(sk, &current_timeo, sk->sk_err ||
(sk->sk_shutdown & SEND_SHUTDOWN) ||
(sk_stream_memory_free(sk) &&
!vm_wait));
sk->sk_write_pending--;
if (vm_wait) {
vm_wait -= current_timeo;
current_timeo = *timeo_p;
if (current_timeo != MAX_SCHEDULE_TIMEOUT &&
(current_timeo -= vm_wait) < 0)
current_timeo = 0;
vm_wait = 0;
}
*timeo_p = current_timeo;
}
out:
finish_wait(sk_sleep(sk), &wait);
return err;
do_error:
err = -EPIPE;
goto out;
do_nonblock:
err = -EAGAIN;
goto out;
do_interrupted:
err = sock_intr_errno(*timeo_p);
goto out;
}
EXPORT_SYMBOL(sk_stream_wait_memory);
int sk_stream_error(struct sock *sk, int flags, int err)
{
if (err == -EPIPE)
err = sock_error(sk) ? : -EPIPE;
if (err == -EPIPE && !(flags & MSG_NOSIGNAL))
send_sig(SIGPIPE, current, 0);
return err;
}
EXPORT_SYMBOL(sk_stream_error);
void sk_stream_kill_queues(struct sock *sk)
{
/* First the read buffer. */
__skb_queue_purge(&sk->sk_receive_queue);
/* Next, the error queue. */
__skb_queue_purge(&sk->sk_error_queue);
/* Next, the write queue. */
WARN_ON(!skb_queue_empty(&sk->sk_write_queue));
/* Account for returned memory. */
[NET] CORE: Introducing new memory accounting interface. This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-12-31 16:11:19 +08:00
sk_mem_reclaim(sk);
WARN_ON(sk->sk_wmem_queued);
WARN_ON(sk->sk_forward_alloc);
/* It is _impossible_ for the backlog to contain anything
* when we get here. All user references to this socket
* have gone away, only the net layer knows can touch it.
*/
}
EXPORT_SYMBOL(sk_stream_kill_queues);