percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Currently we treat IGMPv3 reports as if it were an IGMPv2/v1 report.
This is broken as IGMPv3 reports are formatted differently. So we
end up suppressing a bogus multicast group (which should be harmless
as long as the leading reserved field is zero).
In fact, IGMPv3 does not allow membership report suppression so
we should simply ignore IGMPv3 membership reports as a host.
This patch does exactly that. I kept the case statement for it
so people won't accidentally add it back thinking that we overlooked
this case.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Almost all igmp functions accessing inet->mc_list are protected by
rtnl_lock(), but there is one exception which is ip_mc_sf_allow(),
so there is a chance of either ip_mc_drop_socket or ip_mc_leave_group
remove an entry while ip_mc_sf_allow is running causing a crash.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__net_init/__net_exit are apparently not going away, so use them
to full extent.
In some cases __net_init was removed, because it was called from
__net_exit code.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not including net/atm/
Compiled tested x86 allyesconfig only
Added a > 80 column line or two, which I ignored.
Existing checkpatch plaints willfully, cheerfully ignored.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to use next_det_device_rcu() in RCU protected section.
We also can avoid in_dev_get()/in_dev_put() overhead (code size mainly)
in rcu_read_lock() sections.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also needs to be optimized for large number of devices.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes commit e36b9d16c6. The approach
there is to call dev_close()/dev_open() whenever the device type is changed in
order to remap the device IP multicast addresses to HW multicast addresses.
This approach suffers from 2 drawbacks:
*. It assumes tha the device is UP when calling dev_close(), or otherwise
dev_close() has no affect. It is worth to mention that initscripts (Redhat)
and sysconfig (Suse) doesn't act the same in this matter.
*. dev_close() has other side affects, like deleting entries from the routing
table, which might be unnecessary.
The fix here is to directly remap the IP multicast addresses to HW multicast
addresses for a bonding device that changes its type, and nothing else.
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb
Delete skb->rtable field
Setting rtable is not allowed, just set dst instead as rtable is an alias.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After some discussion offline with Christoph Lameter and David Stevens
regarding multicast behaviour in Linux, I'm submitting a slightly
modified patch from the one Christoph submitted earlier.
This patch provides a new socket option IP_MULTICAST_ALL.
In this case, default behaviour is _unchanged_ from the current
Linux standard. The socket option is set by default to provide
original behaviour. Sockets wishing to receive data only from
multicast groups they join explicitly will need to clear this
socket option.
Signed-off-by: Nivedita Singhvi <niv@us.ibm.com>
Signed-off-by: Christoph Lameter<cl@linux.com>
Acked-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the followinf proc entries per-netns:
/proc/net/igmp
/proc/net/mcfilter
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add mc_count to struct in_device and updates
increment/decrement/initilaize of this field in IPv4 and in IPv6.
- Also printing the vfs /proc entry (/proc/net/igmp) is adjusted to
use the new mc_count.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the multicast socket to be per namespace.
When a network namespace is created, other than the init_net and a
multicast packet is received, the kernel goes to a hang or a kernel panic.
How to reproduce ?
* create a child network namespace
* create a pair virtual device veth
* ip link add type veth
* move one side to the pair network device to the child namespace
* ip link set netns <childpid> dev veth1
* ping -I veth0 224.0.0.1
The bug appears because the function ip_mc_init_dev does not initialize
the different multicast fields as it exits because it is not the init_net.
BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695]
Modules linked in:
irq event stamp: 50350
hardirqs last enabled at (50349): [<c03ee949>] _spin_unlock_irqrestore+0x34/0x39
hardirqs last disabled at (50350): [<c03ec639>] schedule+0x9f/0x5ff
softirqs last enabled at (45712): [<c0374d4b>] ip_setsockopt+0x8e7/0x909
softirqs last disabled at (45710): [<c03ee682>] _spin_lock_bh+0x8/0x27
Pid: 2695, comm: avahi-daemon Not tainted (2.6.27-rc2-00029-g0872073 #3)
EIP: 0060:[<c03ee47c>] EFLAGS: 00000297 CPU: 0
EIP is at __read_lock_failed+0x8/0x10
EAX: c4f38810 EBX: c4f38810 ECX: 00000000 EDX: c04cc22e
ESI: fb0000e0 EDI: 00000011 EBP: 0f02000a ESP: c4e3faa0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 44618a40 CR3: 04e37000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
[<c02311f8>] ? _raw_read_lock+0x23/0x25
[<c0390666>] ? ip_check_mc+0x1c/0x83
[<c036d478>] ? ip_route_input+0x229/0xe92
[<c022e2e4>] ? trace_hardirqs_on_thunk+0xc/0x10
[<c0104c9c>] ? do_IRQ+0x69/0x7d
[<c0102e64>] ? restore_nocheck_notrace+0x0/0xe
[<c036fdba>] ? ip_rcv+0x227/0x505
[<c0358764>] ? netif_receive_skb+0xfe/0x2b3
[<c03588d2>] ? netif_receive_skb+0x26c/0x2b3
[<c035af31>] ? process_backlog+0x73/0xbd
[<c035a8cd>] ? net_rx_action+0xc1/0x1ae
[<c01218a8>] ? __do_softirq+0x7b/0xef
[<c0121953>] ? do_softirq+0x37/0x4d
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c0122037>] ? local_bh_enable+0x96/0xab
[<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b
[<c012181e>] ? _local_bh_enable+0x79/0x88
[<c035fcb8>] ? neigh_resolve_output+0x20f/0x239
[<c0373118>] ? ip_finish_output+0x1df/0x209
[<c0373364>] ? ip_dev_loopback_xmit+0x62/0x66
[<c0371db5>] ? ip_local_out+0x15/0x17
[<c0372013>] ? ip_push_pending_frames+0x25c/0x2bb
[<c03891b8>] ? udp_push_pending_frames+0x2bb/0x30e
[<c038a189>] ? udp_sendmsg+0x413/0x51d
[<c038a1a9>] ? udp_sendmsg+0x433/0x51d
[<c038f927>] ? inet_sendmsg+0x35/0x3f
[<c034f092>] ? sock_sendmsg+0xb8/0xd1
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c022e6de>] ? copy_from_user+0x32/0x5e
[<c034f238>] ? sys_sendmsg+0x18d/0x1f0
[<c0175e90>] ? pipe_write+0x3cb/0x3d7
[<c0170347>] ? do_sync_write+0xbe/0x105
[<c012d554>] ? autoremove_wake_function+0x0/0x2b
[<c03503b2>] ? sys_socketcall+0x176/0x1b0
[<c01085ea>] ? syscall_trace_enter+0x6c/0x7b
[<c0102e1a>] ? syscall_call+0x7/0xb
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without CONFIG_NET_NS, namespace is always &init_net.
Compiler will be able to omit namespace comparisons with this patch.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes CVS keywords that weren't updated for a long time
from comments.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds needed_headroom/needed_tailroom members to struct
net_device and updates many places that allocate sbks to use them. Not
all of them can be converted though, and I'm sure I missed some (I
mostly grepped for LL_RESERVED_SPACE)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
(Anonymous) unions can help us to avoid ugly casts.
A common cast it the (struct rtable *)skb->dst one.
Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not calls hooks from device notifiers and disallow configuration from
ioctl/netlink layer.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Needed to propagate it down to the ip_route_output_flow.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
in_dev_find() need a namespace to pass it to fib_get_table(), so add
an argument.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Partial sparse warning fix. The other conditional locking
is too much for sparse to handle.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These are scattered over the code, but almost all the
"critical" places already have the proper struct net
at hand except for snmp proc showing function and routing
rtnl handler.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most callers of the LOCAL_OUT chain will set the IP packet length and
header checksum before doing so. They also share the same output
function dst_output.
This patch creates a new function called ip_local_out which does all
of that and converts the appropriate users over to it.
Apart from removing duplicate code, it will also help in merging the
IPsec output path once the same thing is done for IPv6.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.
The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a bunch of sparse warnings. Mostly about 0 used as
NULL pointer, and shadowed variable declarations.
One notable case was that hash size should have been unsigned.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl
were modified to take a network namespace argument, and
deal with it.
vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.
So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.
For now the ifindex generator is left global.
Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.
At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.
Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts the ipv4_devconf config members (everything except
sysctl) to an array. This allows easier manipulation which will be
needed later on to provide better management of default config values.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cleanup of dev_base list use, with the aim to simplify making device
list per-namespace. In almost every occasion, use of dev_base variable
and dev->next pointer could be easily replaced by for_each_netdev
loop. A few most complicated places were converted to using
first_netdev()/next_netdev().
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)
Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This time we have to set it to skb->tail that is not anymore equal to
skb->data, so we either add a new helper or just add the skb->tail - skb->data
offset, for now do the later.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The seq_file operations stuff can be marked constant to
get it out of dirty cache.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
But only in the cases where its a newly allocated skb, i.e. one where skb->tail
is equal to skb->data, or just after skb_reserve, where this requirement is
maintained.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kill warning about unused variable `in_dev' when CONFIG_IP_MULTICAST
is not set.
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In active-backup mode, the current bonding code duplicates IGMP
traffic to all slaves, so that switches are up to date in case of a
failover from an active to a backup interface. If bonding then fails
back to the original active interface, it is likely that the "active
slave" switch's IGMP forwarding for the port will be out of date until
some event occurs to refresh the switch (e.g., a membership query).
This patch alters the behavior of bonding to no longer flood
IGMP to all ports, and to issue IGMP JOINs to the newly active port at
the time of a failover. This insures that switches are kept up to date
for all cases.
"GOELLESCH Niels" <niels.goellesch@eurocontrol.int> originally
reported this problem, and included a patch. His original patch was
modified by Jay Vosburgh to additionally remove the existing IGMP flood
behavior, use RCU, streamline code paths, fix trailing white space, and
adjust for style.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>