This reverts commit 15e83ed788.
As explained by Johannes Berg, the optimization made here is
invalid. Or, at best, incomplete.
Not only destructor invocation, but conntract entry releasing
must be executed outside of hw IRQ context.
So just checking "skb->destructor" is insufficient.
Signed-off-by: David S. Miller <davem@davemloft.net>
CodingStyle cleanups
EXPORT_SYMBOL should immediately follow the symbol declaration.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove rtnl_unlock() which had no corresponding rtnl_lock().
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the functions __netpoll_setup/__netpoll_cleanup
which is designed to be called recursively through ndo_netpoll_seutp.
They must be called with RTNL held, and the caller must initialise
np->dev and ensure that it has a valid reference count.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds ndo_netpoll_setup as the initialisation primitive
to complement ndo_netpoll_cleanup.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it stands, netpoll_setup and netpoll_cleanup have no locking
protection whatsoever. So chaos ensures if two entities try to
perform them on the same device.
This patch adds RTNL to the equation. The code has been rearranged so
that bits that do not need RTNL protection are now moved to the top of
netpoll_setup.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of RCU in netpoll is incorrect in a number of places:
1) The initial setting is lacking a write barrier.
2) The synchronize_rcu is in the wrong place.
3) Read barriers are missing.
4) Some places are even missing rcu_read_lock.
5) npinfo is zeroed after freeing.
This patch fixes those issues. As most users are in BH context,
this also converts the RCU usage to the BH variant.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we have to NULL npinfo regardless of whether there is a
ndo_netpoll_cleanup, it makes sense to do this unconditionally
in netpoll_cleanup rather than having every driver do it by
themselves.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
netpoll does an interesting work in zap_completion_queue(), but this was
before we did skb orphaning before delivering packets to device.
It now makes sense to add a test in dev_kfree_skb_irq() to not queue a
skb if already orphaned, and to remove netpoll zap_completion_queue() as
a bonus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This whole patchset is for adding netpoll support to bridge and bonding
devices. I already tested it for bridge, bonding, bridge over bonding,
and bonding over bridge. It looks fine now.
To make bridge and bonding support netpoll, we need to adjust
some netpoll generic code. This patch does the following things:
1) introduce two new priv_flags for struct net_device:
IFF_IN_NETPOLL which identifies we are processing a netpoll;
IFF_DISABLE_NETPOLL is used to disable netpoll support for a device
at run-time;
2) introduce one new method for netdev_ops:
->ndo_netpoll_cleanup() is used to clean up netpoll when a device is
removed.
3) introduce netpoll_poll_dev() which takes a struct net_device * parameter;
export netpoll_send_skb() and netpoll_poll_dev() which will be used later;
4) hide a pointer to struct netpoll in struct netpoll_info, ditto.
5) introduce ->real_dev for struct netpoll.
6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable
netconsole before releasing a slave, to avoid deadlocks.
Cc: David Miller <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
v2: update according to Frans' comments.
Currently, if we leave spaces before dst port,
netconsole will silently accept it as 0. Warn about this.
Also, when spaces appear in other places, make them
visible in error messages.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stanse found that one error path in netpoll_setup dereferences npinfo
even though it is NULL. Avoid that by adding new label and go to that
instead.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Borkmann <danborkmann@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: chavey@google.com
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
WARN_ONCE for ndo_start_xmit() enable interrupts in netpoll_send_skb(),
because the NETPOLL API requires that interrupts remain disabled in
netpoll_send_skb().
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using early netconsole and gianfar driver this error pops up:
netconsole: timeout waiting for carrier
It appears that net/core/netpoll.c:netpoll_setup() is using
cond_resched() in a loop waiting for a carrier.
The thing is that cond_resched() is a no-op when system_state !=
SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
scheduled, therefore link detection doesn't work.
I belive that the main problem is in cond_resched()[1], but despite
how the cond_resched() story ends, it might be a good idea to call
msleep(1) instead of cond_resched(), as suggested by Andrew Morton.
[1] http://lkml.org/lkml/2009/7/7/463
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some PHYs require longer timeouts for carrier detection, and
auto-negotiation process may take indefinite amount of time.
It may be inconvenient to force longer timeouts for sane PHYs,
so let's introduce a kernel command line option.
Since we're using module_param(), the option also can be
changed in runtime.
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We would like to get rid of netdev->trans_start = jiffies; that about all net
drivers have to use in their start_xmit() function, and use txq->trans_start
instead.
This can be done generically in core network, as suggested by David.
Some devices, (particularly loopback) dont need trans_start update, because
they dont have transmit watchdog. We could add a new device flag, or rely
on fact that txq->tran_start can be updated is txq->xmit_lock_owner is
different than -1. Use a helper function to hide our choice.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patch to add the ability to detect drops in hardware interfaces via dropwatch.
Adds a tracepoint to net_rx_action to signal everytime a napi instance is
polled. The dropmon code then periodically checks to see if the rx_frames
counter has changed, and if so, adds a drop notification to the netlink
protocol, using the reserved all-0's vector to indicate the drop location was in
hardware, rather than somewhere in the code.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
include/linux/net_dropmon.h | 8 ++
include/trace/napi.h | 11 +++
net/core/dev.c | 5 +
net/core/drop_monitor.c | 124 ++++++++++++++++++++++++++++++++++++++++++--
net/core/net-traces.c | 4 +
net/core/netpoll.c | 2
6 files changed, 149 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like the dev in netpoll_poll can be NULL - at lease it's
checked at the function beginning. Thus the dev->netde_ops dereference
looks dangerous.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows for the removal of byteswapping in some places and
the removal of HIPQUAD (replaced by %pI4).
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few months back a race was discused between the netpoll napi service
path, and the fast path through net_rx_action:
http://kerneltrap.org/mailarchive/linux-netdev/2007/10/16/345470
A patch was submitted for that bug, but I think we missed a case.
Consider the following scenario:
INITIAL STATE
CPU0 has one napi_struct A on its poll_list
CPU1 is calling netpoll_send_skb and needs to call poll_napi on the same
napi_struct A that CPU0 has on its list
CPU0 CPU1
net_rx_action poll_napi
!list_empty (returns true) locks poll_lock for A
poll_one_napi
napi->poll
netif_rx_complete
__napi_complete
(removes A from poll_list)
list_entry(list->next)
In the above scenario, net_rx_action assumes that the per-cpu poll_list is
exclusive to that cpu. netpoll of course violates that, and because the netpoll
path can dequeue from the poll list, its possible for CPU0 to detect a non-empty
list at the top of the while loop in net_rx_action, but have it become empty by
the time it calls list_entry. Since the poll_list isn't surrounded by any other
structure, the returned data from that list_entry call in this situation is
garbage, and any number of crashes can result based on what exactly that garbage
is.
Given that its not fasible for performance reasons to place exclusive locks
arround each cpus poll list to provide that mutal exclusion, I think the best
solution is modify the netpoll path in such a way that we continue to guarantee
that the poll_list for a cpu is in fact exclusive to that cpu. To do this I've
implemented the patch below. It adds an additional bit to the state field in
the napi_struct. When executing napi->poll from the netpoll_path, this bit will
be set. When a driver calls netif_rx_complete, if that bit is set, it will not
remove the napi_struct from the poll_list. That work will be saved for the next
iteration of net_rx_action.
I've tested this and it seems to work well. About the biggest drawback I can
see to it is the fact that it might result in an extra loop through
net_rx_action in the event that the device is actually contended for (i.e. the
netpoll path actually preforms all the needed work no the device, and the call
to net_rx_action winds up doing nothing, except removing the napi_struct from
the poll_list. However I think this is probably a small price to pay, given
that the alternative is a crash.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves neigh_setup and hard_start_xmit into the network device ops
structure. For bisection, fix all the previously converted drivers as well.
Bonding driver took the biggest hit on this.
Added a prefetch of the hard_start_xmit in the fast path to try and reduce
any impact this would have.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the network device internal API to move adminstrative
operations out of the network device structure and into a separate structure.
This patch involves some hackery to maintain compatablity between the
new and old model, so all 300+ drivers don't have to be changed at once.
For drivers that aren't converted yet, the netdevice_ops virt function list
still resides in the net_device structure. For old protocols, the new
net_device_ops are copied out to the old net_device pointers.
After the transistion is completed the nag message can be changed to
an WARN_ON, and the compatiablity code can be made configurable.
Some function pointers aren't moved:
* destructor can't be in net_device_ops because
it may need to be referenced after the module is unloaded.
* neighbor setup is manipulated in a couple of places that need special
consideration
* hard_start_xmit is in the fast path for transmit.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first argument to csum_partial is const void *
casts to char/u8 * are not necessary
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts pretty much everything to print_mac. There were
a few things that had conflicts which I have just dropped for
now, no harm done.
I've built an allyesconfig with this and looked at the files
that weren't built very carefully, but it's a huge patch.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When support for multiple TX queues were added, the
netif_tx_lock() routines we converted to iterate over
all TX queues and grab each queue's spinlock.
This causes heartburn for lockdep and it's not a healthy
thing to do with lots of TX queues anyways.
So modify this to use a top-level lock and a "frozen"
state for the individual TX queues.
Signed-off-by: David S. Miller <davem@davemloft.net>
This effectively "flips the switch" by making the core networking
and multiqueue-aware drivers use the new TX multiqueue structures.
Non-multiqueue drivers need no changes. The interfaces they use such
as netif_stop_queue() degenerate into an operation on TX queue zero.
So everything "just works" for them.
Code that really wants to do "X" to all TX queues now invokes a
routine that does so, such as netif_tx_wake_all_queues(),
netif_tx_stop_all_queues(), etc.
pktgen and netpoll required a little bit more surgery than the others.
In particular the pktgen changes, whilst functional, could be largely
improved. The initial check in pktgen_xmit() will sometimes check the
wrong queue, which is mostly harmless. The thing to do is probably to
invoke fill_packet() earlier.
The bulk of the netpoll changes is to make the code operate solely on
the TX queue indicated by by the SKB queue mapping.
Setting of the SKB queue mapping is entirely confined inside of
net/core/dev.c:dev_pick_tx(). If we end up needing any kind of
special semantics (drops, for example) it will be implemented here.
Finally, we now have a "real_num_tx_queues" which is where the driver
indicates how many TX queues are actually active.
With IGB changes from Jeff Kirsher.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
zap_completion_queue() retrieves skbs from completion_queue where they have
zero skb->users counter. Before dev_kfree_skb_any() it should be non-zero
yet, so it's increased now.
Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a report by Andrew Morton and code analysis done
by Jarek Poplawski.
This reverts 33f807ba0d ("[NETPOLL]:
Kill NETPOLL_RX_DROP, set but never tested.") and
c7b6ea24b4 ("[NETPOLL]: Don't need
rx_flags.").
The rx_flags did get tested for zero vs. non-zero and therefore we do
need those tests and that code which sets NETPOLL_RX_DROP et al.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are some place, that calculate the ARP header length. These
calculations are correct, but
a) some operate with "magic" constants,
b) enlarge the code length (sometimes at the cost of coding style),
c) are not informative from the first glance.
The proposal is to introduce a helper, that includes all the good
sides of these calculations.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rx_flags variable is redundant. Turning rx on/off is done
via setting the rx_np pointer.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The local_mac is managed by the network device, no need to keep a
spare copy and all the management problems that could cause.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Restructure code slightly to improve readability:
* dereference device once
* change obvious while() loop
* let poll_napi() handle null list itself
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use standard routine for flushing queue.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netpoll_poll_lock() synchronizes the ->poll() invocation
code paths, but once we have the lock we have to make
sure that NAPI_STATE_SCHED is still set. Otherwise we
get:
cpu 0 cpu 1
net_rx_action() poll_napi()
netpoll_poll_lock() ... spin on ->poll_lock
->poll()
netif_rx_complete
netpoll_poll_unlock() acquire ->poll_lock()
->poll()
netif_rx_complete()
CRASH
Based upon a bug report from Tina Yang.
Signed-off-by: David S. Miller <davem@davemloft.net>
Many places get the queue_mapping field from skb to pass it to the
netif_subqueue_stopped() which will be 0 in any case.
Make the helper that works with sk_buff
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon initial work by Keiichi Kii <k-keiichi@bx.jp.nec.com>.
This patch introduces support for dynamic reconfiguration (adding, removing
and/or modifying parameters of netconsole targets at runtime) using a
userspace interface exported via configfs. Documentation is also updated
accordingly.
Issues and brief design overview:
(1) Kernel-initiated creation / destruction of kernel objects is not
possible with configfs -- the lifetimes of the "config items" is managed
exclusively from userspace. But netconsole must support boot/module
params too, and these are parsed in kernel and hence netpolls must be
setup from the kernel. Joel Becker suggested to separately manage the
lifetimes of the two kinds of netconsole_target objects -- those created
via configfs mkdir(2) from userspace and those specified from the
boot/module option string. This adds complexity and some redundancy here
and also means that boot/module param-created targets are not exposed
through the configfs namespace (and hence cannot be updated / destroyed
dynamically). However, this saves us from locking / refcounting
complexities that would need to be introduced in configfs to support
kernel-initiated item creation / destroy there.
(2) In configfs, item creation takes place in the call chain of the
mkdir(2) syscall in the driver subsystem. If we used an ioctl(2) to
create / destroy objects from userspace, the special userspace program is
able to fill out the structure to be passed into the ioctl and hence
specify attributes such as local interface that are required at the time
we set up the netpoll. For configfs, this information is not available at
the time of mkdir(2). So, we keep all newly-created targets (via
configfs) disabled by default. The user is expected to set various
attributes appropriately (including the local network interface if
required) and then write(2) "1" to the "enabled" attribute. Thus,
netpoll_setup() is then called on the set parameters in the context of
_this_ write(2) on the "enabled" attribute itself. This design enables
the user to reconfigure existing netconsole targets at runtime to be
attached to newly-come-up interfaces that may not have existed when
netconsole was loaded or when the targets were actually created. All this
effectively enables us to get rid of custom ioctls.
(3) Ultra-paranoid configfs attribute show() and store() operations, with
sanity and input range checking, using only safe string primitives, and
compliant with the recommendations in Documentation/filesystems/sysfs.txt.
(4) A new function netpoll_print_options() is created in the netpoll API,
that just prints out the configured parameters for a netpoll structure.
netpoll_parse_options() is modified to use that and it is also exported to
be used from netconsole.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Acked-by: Keiichi Kii <k-keiichi@bx.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 29578624e3.
Ingo Molnar reports complete breakage with his e1000 card (no
networking, card reports transmit timeouts), and bisected it down to
this commit. Let's figure out what went wrong, but not keep breaking
machines until we do.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Olaf Kirch <olaf.kirch@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keep netpoll/poll_napi from messing with the poll_list.
Only net_rx_action is allowed to manipulate the list.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
93ec2c723e applied excessive duct tape to
the netpoll beast's netpoll_cleanup(), thus substituting one leak with
another, and opening up a little buglet :-)
net_device->npinfo (netpoll_info) is a shared and refcounted object and
cannot simply be set NULL the first time netpoll_cleanup() is called.
Otherwise, further netpoll_cleanup()'s see np->dev->npinfo == NULL and
become no-ops, thus leaking. And it's a bug too: the first call to
netpoll_cleanup() would thus (annoyingly) "disable" other (still alive)
netpolls too. Maybe nobody noticed this because netconsole (only user
of netpoll) never supported multiple netpoll objects earlier.
This is a trivial and obvious one-line fixlet.
Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the multiqueue hardware device support API to the core network
stack. Allow drivers to allocate multiple queues and manage them at
the netdev level if they choose to do so.
Added a new field to sk_buff, namely queue_mapping, for drivers to
know which tx_ring to select based on OS classification of the flow.
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
>From my recent patch:
> > #1
> > Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
> > required a work function should always (unconditionally) rearm with
> > delay > 0 - otherwise it would endlessly loop. This patch replaces
> > this function with cancel_delayed_work(). Later kernel versions don't
> > require this, so here it's only for uniformity.
But Oleg Nesterov <oleg@tv-sign.ru> found:
> But 2.6.22 doesn't need this change, why it was merged?
>
> In fact, I suspect this change adds a race,
...
His description was right (thanks), so this patch reverts #1.
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
#1
Until kernel ver. 2.6.21 (including) cancel_rearming_delayed_work()
required a work function should always (unconditionally) rearm with
delay > 0 - otherwise it would endlessly loop. This patch replaces
this function with cancel_delayed_work(). Later kernel versions don't
require this, so here it's only for uniformity.
#2
After deleting a timer in cancel_[rearming_]delayed_work() there could
stay a last skb queued in npinfo->txq causing a memory leak after
kfree(npinfo).
Initial patch & testing by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If sky2 device poll routine is called from netpoll_send_skb, it would
deadlock. The netpoll_send_skb held the netif_tx_lock, and the poll
routine could acquire it to clean up skb's. Other drivers might use
same locking model.
The driver is correct, netpoll should not introduce more locking
problems than it causes already. So change the code to drop lock
before calling poll handler.
Signed-off-by: Stephen Hemminger <shemminger@linux.foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should
treat it as such in the stack.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
To clearly state the intent of copying to linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.
This one touches just the most simple cases:
skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()
The next ones will handle the slightly more "complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_push updates and returns skb->data, so we can just call
skb_reset_network_header after the call to skb_push.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.
This one touches just the most simple case, next will handle the slightly more
"complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.
This one touches just the most simple case, next will handle the slightly more
"complex" cases.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In net poll mode, the current checksum function doesn't consider the
kind of packet which is padded to reach a specific minimum length. I
believe that's the problem causing my test case failed. The following
patch fixed this issue.
Signed-off-by: Aubrey.Li <aubreylee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
current -git doesnt boot on my laptop due to netpoll not unlocking the
tx lock in the else branch.
booted this up on my laptop with lockdep enabled and there are no
locking complaints and it works fine.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
During boot we get:
netconsole: device eth0 not up yet, forcing it
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
WARNING (!__warned) at kernel/softirq.c:137 local_bh_enable()
Call Trace:
[<ffffffff80235baf>] local_bh_enable+0x41/0xa3
[<ffffffff8045ab8e>] netpoll_send_skb+0x116/0x144
[<ffffffff8045b1ee>] netpoll_send_udp+0x263/0x271
[<ffffffff803d41ec>] write_msg+0x42/0x5e
[<ffffffff80230c9b>] __call_console_drivers+0x5f/0x70
[<ffffffff80230d19>] _call_console_drivers+0x6d/0x71
[<ffffffff802313f0>] release_console_sem+0x148/0x1ec
[<ffffffff802316ce>] register_console+0x1b1/0x1ba
[<ffffffff803d4178>] init_netconsole+0x54/0x68
[<ffffffff802071ae>] init+0x152/0x308
[<ffffffff804dac8b>] _spin_unlock_irq+0x14/0x30
[<ffffffff8022c15e>] schedule_tail+0x43/0x9f
[<ffffffff8020a758>] child_rip+0xa/0x12
Herbert sayeth:
Normally networking isn't invoked with interrupts turned off, but I
suppose we don't have a choice here. This is unique being a place where you
can get called with BH on, off, or IRQs off.
Given that this is only used for printk, the easiest solution is probably
just to disable local IRQs instead of BH.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Back in 2.4 arp requests that were recevied by netpoll were processed
in netconsole_receive_skb, where they were responded to using the src
mac of the request sender. In the 2.6 kernel arp_reply is responsible
for this function, but instead of using the src mac address of the
incomming request, the stored mac address that was registered for the
netconsole application is used. While this is usually ok, it can lead
to failures in netpoll in some situations (specifically situations
where a network may have two gateways, as arp requests from one may be
responded to using the mac address of the other). This patch reverts
the behavior to what we had in 2.4, in which all arp requests are sent
back using the src address of the request sender.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/infiniband/core/iwcm.c
drivers/net/chelsio/cxgb2.c
drivers/net/wireless/bcm43xx/bcm43xx_main.c
drivers/net/wireless/prism54/islpci_eth.c
drivers/usb/core/hub.h
drivers/usb/input/hid-core.c
net/core/netpoll.c
Fix up merge failures with Linus's head and fix new compilation failures.
Signed-Off-By: David Howells <dhowells@redhat.com>
The beast had a long and not very happy history. At one
point, a friend (netdump) had asked that he open up a little.
Well, the friend was long gone now, and the beast had
this dangling piece hanging (netpoll_queue).
It wasn't hard to stitch the netpoll_queue back in
where it belonged and make everything tidy.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
The netpoll beast was still not happy. If the beast got
clogged pipes, it tended to stare blankly off in space
for a long time.
The problem couldn't be completely fixed because the
beast talked with irq's disabled. But it could be made
less painful and shorter.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
When the netpoll beast got busy, he tended to babble.
Instead of talking out of his large mouth as normal,
he tended to try to snort out other orifices. This lead
to words (skbs) ending up in odd places (like NIT) that
he did not intend.
The normal way of talking wouldn't work, but he could
at least change to using the same tone all the time.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
The beast was not always healthy. When it was sick,
it tended to be laconic and not tell anyone the real problem.
A few small changes had it telling the world about its
problems, if they really wanted to hear.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
When the netpoll beast got really busy, it tended to clog
things, so it stored them for later. But the beast was putting
all it's skb's in one basket. This was bad because maybe some
pipes were clogged and others were not.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
After looking harder, Steve noticed that the netpoll
beast leaked a little every time it shutdown for a nap.
Not a big leak, but a nuisance kind of thing.
He took out his refcount duct tape and patched the
leak. It was overkill since there was already other
locking in that area, but it looked clean and wouldn't
attract fleas.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
It was a dark and stormy night when Steve first saw the
netpoll beast. The beast was odd, and misshapen but not
extremely ugly.
"Let me take off one of your warts" he said. This wart
is where you tried to make an skb list yourself. If the
beast had ever run out of memory, he would have stupefied
himself unnecessarily.
The first try was painful, so he tried again till the bleeding
stopped. And again, and again...
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Need to fully initialize skb to keep lower layers and queueing happy.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).
Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll system currently has a rx to tx path via:
netpoll_rx
__netpoll_rx
arp_reply
netpoll_send_skb
dev->hard_start_tx
This rx->tx loop places network drivers at risk of inadvertently causing a
deadlock or BUG halt by recursively trying to acquire a spinlock that is
used in both their rx and tx paths (this problem was origionally reported
to me in the 3c59x driver, which shares a spinlock between the
boomerang_interrupt and boomerang_start_xmit routines).
This patch breaks this loop, by queueing arp frames, so that they can be
responded to after all receive operations have been completed. Tested by
myself and the reported with successful results.
Specifically it was tested with netdump. Heres the BZ with details:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=194055
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When transmitting a skb in netpoll_send_skb(), only retry a limited number
of times if the device queue is stopped.
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various drivers use xmit_lock internally to synchronise with their
transmission routines. They do so without setting xmit_lock_owner.
This is fine as long as netpoll is not in use.
With netpoll it is possible for deadlocks to occur if xmit_lock_owner
isn't set. This is because if a printk occurs while xmit_lock is held
and xmit_lock_owner is not set can cause netpoll to attempt to take
xmit_lock recursively.
While it is possible to resolve this by getting netpoll to use
trylock, it is suboptimal because netpoll's sole objective is to
maximise the chance of getting the printk out on the wire. So
delaying or dropping the message is to be avoided as much as possible.
So the only alternative is to always set xmit_lock_owner. The
following patch does this by introducing the netif_tx_lock family of
functions that take care of setting/unsetting xmit_lock_owner.
I renamed xmit_lock to _xmit_lock to indicate that it should not be
used directly. I didn't provide irq versions of the netif_tx_lock
functions since xmit_lock is meant to be a BH-disabling lock.
This is pretty much a straight text substitution except for a small
bug fix in winbond. It currently uses
netif_stop_queue/spin_unlock_wait to stop transmission. This is
unsafe as an IRQ can potentially wake up the queue. So it is safer to
use netif_tx_disable.
The hamradio bits used spin_lock_irq but it is unnecessary as
xmit_lock must never be taken in an IRQ handler.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch turns the RTNL from a semaphore to a new 2.6.16 mutex and
gets rid of some of the leftover legacy.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces a memcmp() with is_zero_ether_addr().
Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.
Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults. If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name. In future it can turn off RX checksum.
I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:
* Those places where checksums are done bit by bit. These will call
netdev_rx_csum_fault directly.
* The following have not been completely checked/converted:
ipmr
ip_vs
netfilter
dccp
This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
introduces __in_dev_get_rcu() to cover the second case.
1) RCU with refcnt should use in_dev_get().
2) RCU without refcnt should use __in_dev_get_rcu().
3) All others must hold RTNL and use __in_dev_get_rtnl().
There is one exception in net/ipv4/route.c which is in fact a pre-existing
race condition. I've marked it as such so that we remember to fix it.
This patch is based on suggestions and prior work by Suzanne Wood and
Paul McKenney.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts the final 20 DEFINE_SPINLOCK holdouts. (another 580 places
are already using DEFINE_SPINLOCK). Build tested on x86.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes a race during initialization with the NAPI softirq
processing by using an RCU approach.
This race was discovered when refill_skbs() was added to
the setup code.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
we could do one thing (see the patch below): i think it would be useful
to fill up the netlogging skb queue straight at initialization time.
Especially if netpoll is used for dumping alone, the system might not be
in a situation to fill up the queue at the point of crash, so better be
a bit more prepared and keep the pipeline filled.
[ I've modified this to be called earlier - mpm ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add limited retry logic to netpoll_send_skb
Each time we attempt to send, decrement our per-device retry counter.
On every successful send, we reset the counter.
We delay 50us between attempts with up to 20000 retries for a total of
1 second. After we've exhausted our retries, subsequent failed
attempts will try only once until reset by success.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor netpoll_send_skb restructuring
Restructure to avoid confusing goto and move some bits out of the
retry loop.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an obvious deadlock in the netpoll code. netpoll_rx takes the
npinfo->rx_lock. netpoll_rx is also the only caller of arp_reply (through
__netpoll_rx). As such, it is not necessary to take this lock.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize npinfo->rx_flags. The way it stands now, this will have random
garbage, and so will incur a locking penalty even when an rx_hook isn't
registered and we are not active in the netpoll polling code.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides support for registering multiple netpoll clients to the
same network device. Only one of these clients may register an rx_hook,
however. In practice, this restriction has not been problematic. It is
worth mentioning, though, that the current design can be easily extended to
allow for the registration of multiple rx_hooks.
The basic idea of the patch is that the rx_np pointer in the netpoll_info
structure points to the struct netpoll that has rx_hook filled in. Aside
from this one case, there is no need for a pointer from the struct
net_device to an individual struct netpoll.
A lock is introduced to protect the setting and clearing of the np_rx
pointer. The pointer will only be cleared upon netpoll client module
removal, and the lock should be uncontested.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a netpoll_info structure, which the struct net_device
will now point to instead of pointing to a struct netpoll. The reason for
this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock
should be maintained per net_device, not per netpoll; and 2) this is a first
step in providing support for multiple netpoll clients to register against the
same net_device.
The struct netpoll is now pointed to by the netpoll_info structure. As
such, the previous behaviour of the code is preserved.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!