We should get rid of all own SCTP debug printk macros and use the ones
that the kernel offers anyway instead. This makes the code more readable
and conform to the kernel code, and offers all the features of dynamic
debbuging that pr_debug() et al has, such as only turning on/off portions
of debug messages at runtime through debugfs. The runtime cost of having
CONFIG_DYNAMIC_DEBUG enabled, but none of the debug statements printing,
is negligible [1]. If kernel debugging is completly turned off, then these
statements will also compile into "empty" functions.
While we're at it, we also need to change the Kconfig option as it /now/
only refers to the ifdef'ed code portions in outqueue.c that enable further
debugging/tracing of SCTP transaction fields. Also, since SCTP_ASSERT code
was enabled with this Kconfig option and has now been removed, we
transform those code parts into WARNs resp. where appropriate BUG_ONs so
that those bugs can be more easily detected as probably not many people
have SCTP debugging permanently turned on.
To turn on all SCTP debugging, the following steps are needed:
# mount -t debugfs none /sys/kernel/debug
# echo -n 'module sctp +p' > /sys/kernel/debug/dynamic_debug/control
This can be done more fine-grained on a per file, per line basis and others
as described in [2].
[1] https://www.kernel.org/doc/ols/2009/ols2009-pages-39-46.pdf
[2] Documentation/dynamic-debug-howto.txt
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid making code that deals with printing both, IPv4 and
IPv6 addresses, unnecessary complicated as for example ...
if (sa.sa_family == AF_INET6)
printk("... %pI6 ...", ..sin6_addr);
else
printk("... %pI4 ...", ..sin_addr.s_addr);
... it would be better to introduce a format specifier that can deal
with those kind of situations internally; just as we have a "struct
sockaddr" for generic mapping into "struct sockaddr_in" or "struct
sockaddr_in6" as e.g. done in "union sctp_addr". Then, we could
reduce the above statement into something like:
printk("... %pIS ..", &sockaddr);
In case our pointer is NULL, pointer() then deals with that already at
an earlier point in time internally. While we're at it, support for both
%piS/%pIS, where 'S' stands for sockaddr, comes (almost) for free.
Additionally to that, postfix specifiers 'p', 'f' and 's' are supported
as suggested and initially implemented in 2009 by Joe Perches [1].
Handling of those additional specifiers orientate on the initial RFC that
was proposed. Also we support IPv6 compressed format specified by 'c' and
various other IPv4 extensions as stated in the documentation part.
Likely, there are many other areas than just SCTP in the kernel to make
use of this extension as well.
[1] http://patchwork.ozlabs.org/patch/31480/
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
CC: Joe Perches <joe@perches.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger says:
====================
Here is current updates for vxlan in net-next. It includes Mike's changes
to handle multiple destinations and lots of little cosmetic stuff.
This is a fresh vxlan-next repository which was forked from net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Two of the x25 ioctl cases have error paths that break out of the function without
unlocking the socket, leading to this warning:
================================================
[ BUG: lock held when returning to user space! ]
3.10.0-rc7+ #36 Not tainted
------------------------------------------------
trinity-child2/31407 is leaving the kernel with locks still held!
1 lock held by trinity-child2/31407:
#0: (sk_lock-AF_X25){+.+.+.}, at: [<ffffffffa024b6da>] x25_ioctl+0x8a/0x740 [x25]
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following typical setup to implement a ~100 ms RTT and big
amount of reorders has very poor performance because netem
implements the time queue using a linked list.
-----------------------------------------------------------
ETH=eth0
IFB=ifb0
modprobe ifb
ip link set dev $IFB up
tc qdisc add dev $ETH ingress 2>/dev/null
tc filter add dev $ETH parent ffff: \
protocol ip u32 match u32 0 0 flowid 1:1 action mirred egress \
redirect dev $IFB
ethtool -K $ETH gro off tso off gso off
tc qdisc add dev $IFB root netem delay 50ms 10ms limit 100000
tc qd add dev $ETH root netem delay 50ms limit 100000
---------------------------------------------------------
Switch netem time queue to a rb tree, so this kind of setup can work at
high speed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A standard Gobi 3000 reference design module.
Reported-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MC8305 module got an additional entry added based solely on
information from a Windows driver *.inf file. We now have the
actual descriptor layout from one of these modules, and it
consists of two alternate configurations where cfg #1 is a
normal Gobi 2k layout and cfg #2 is MBIM only, using interface
numbers 5 and 6 for MBIM control and data. The extra Windows
driver entry for interface number 5 was most likely a bug.
Deleting the bogus entry to avoid unnecessary qmi_wwan probe
failures when using the MBIM configuration.
Reported-by: Lana Black <sickmind@lavabit.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change Low Latency Sockets code for select and poll so that
when LLS is disabled sched_clock() is never called.
Also, avoid sending POLL_LL to sockets if disabled.
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our use of sched_clock is OK because we don't mind the side effects
of calling it and occasionally waking up on a different CPU.
When CONFIG_DEBUG_PREEMPT is on, disable preempt before calling
sched_clock() so we don't trigger a debug_smp_processor_id() warning.
Reported-by: Cody P Schafer <devel-lists@codyps.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race in neighbour code, because neigh_destroy() uses
skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
while other parts of the code assume neighbour rwlock is what
protects arp_queue
Convert all skb_queue_purge() calls to the __skb_queue_purge() variant
Use __skb_queue_head_init() instead of skb_queue_head_init()
to make clear we do not use arp_queue.lock
And hold neigh->lock in neigh_destroy() to close the race.
Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixed that enh_desc value is always zero.
Due to calling order of stmmac_selec_desc_mode(), enh_desc value is always zero.
Even though mac is set to use enhanced dma descriptor, if enh_desc is zero,
functions related dma descriptor are not working correctly.
Signed-off-by: Byungho An <bh74.an@samsung.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixed operator typo from & to ==.
Due to incorrect operator, the result is incorrect.
Signed-off-by: Byungho An <bh74.an@samsung.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of mixing printk and pr_<level> forms,
just use pr_<level>
Miscellaneous changes around these conversions:
Add a missing newline to avoid message interleaving,
coalesce formats, reflow modified lines to 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setup the multicast list of the net_device instead of
clearing it blindly. This restores the multicast groups
in case of a link down/up event or when resuming from
suspend.
Signed-off-by: Christoph Muellner <christoph.muellner@theobroma-systems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason to skip ECMP lookup when oif is specified, but this implies
to check oif given by user when selecting another route.
When the new route does not match oif requirement, we simply keep the initial
one.
Spotted-by: dingzhi <zhi.ding@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of commit 218774dc34 ("ipv6: add
anti-spoofing checks for 6to4 and 6rd") the sit driver dropped packets
for 2002::/16 destinations and sources even when configured to work as a
tunnel with fixed endpoint. We may only apply the 6rd/6to4 anti-spoofing
checks if the device is not in pointopoint mode.
This was an oversight from me in the above commit, sorry. Thanks to
Roman Mamedov for reporting this!
Reported-by: Roman Mamedov <rm@romanrm.ru>
Cc: David Miller <davem@davemloft.net>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
Yet one more pull request for wireless updates intended for 3.11...
For the mac80211 bits, Johannes says:
"Here we have a few memory leak fixes related to BSS struct handling
mostly from Ben, including a fix for a more theoretical problem
(associating while a BSS struct times out) from myself, a compilation
warning fix from Arend, mesh fixes from Thomas, tracking the beacon
bitrate (Alex), a bandwidth change event fix (Ilan) and some initial
work for 5/10 MHz channels from Simon."
Regarding the iwlwifi bits, Johannes says:
"Emmanuel removed some unneeded/unsupported module parameters and adds a
Bluetooth 1x1 lookup-table for some upcoming products. From Alex I have
an older patch to add low-power receive support, this depended on a
mac80211 commit that only just came in with the merge from wireless-next
I did. Ilan made beacon timings better, and Eytan added some debug
statements for thermal throttling. I have a few cleanups, a fix for a
long-standing but rare warning, and, arguably the most important patch
here, the firmware API version bump for the 7260/3160 devices."
Also included is a Bluetooth pull -- Gustavo says:
"Here goes a set of patches to 3.11. The biggest work here is from Andre Guedes
on the move of the Discovery to use the new request framework. Other than that
Johan provided a bunch of fixes to the L2CAP code. The rest are just small
fixes and clean ups."
On top of all that, there are a variety of updates and fixes to
brcmfmac, rt2x00, wil6210, ath9k, ath10k, and a few others here and
there. This also includes a pull of the wireless tree, in order to
prevent some merge conflicts.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Openvswitch uses function from NET_IPGRE_DEMUX module.
Add Kconfig dependency to fix following compilation errors:
http://marc.info/?l=linux-netdev&m=137244035226634
CC: Jesse Gross <jesse@nicira.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pravin Shelar <pshelar@nicira.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A number of places treated features wrongly, listing not-supported
features instead of supported ones. Also, the get_drvinfo ethtool
callback isn't needed, and alx_get_pauseparam can be simplified.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In two places, parts of MAC addresses are used as u32/u16
values. This can cause alignment problems, use put_unaligned
and get_unaligned to fix this.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Ben Hutchings, use separate fields to track
current link speed and duplex setting.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ring sizes should be unsigned, pointed out by Ben Hutchings.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
That select doesn't make any sense, remove it.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
100mbit half duplex is ADVERTISED_100baseT_Half, not
ADVERTISED_10baseT_Half.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Even when alx_setup_speed_duplex() is called, we still
need to call alx_cfg_mac_flowcontrol() and set hw->flowctrl
if flow control changed.
This was a bug I accidentally introduced while simplifying
the original driver.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the firmware supports the UPDATE_QP command, if the VF link is disabled,
block all QPs opened by the VF, by programming the UPDATE_QP command to drop
all RX & TX traffic to/from these QPs. Operates only in VST mode.
Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Within VST mode, enable modifying the vlan and/or qos
for a VF without requiring unbind/rebind.
This requires firmware which supports the UPDATE_QP command.
(If the command is not available, we fall back to requiring
unbind/bind to activate these changes).
To avoid race conditions with modify-qp on QPs that are affected
by update-qp, this operation is performed on the comm_wq.
If the update operation succeeds for all the necessary QPs, a
vlan_unregister is performed for the abandoned vlan id.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following batch contains Netfilter/IPVS updates for net-next,
they are:
* Enforce policy to several nfnetlink subsystem, from Daniel
Borkmann.
* Use xt_socket to match the third packet (to perform simplistic
socket-based stateful filtering), from Eric Dumazet.
* Avoid large timeout for picked up from the middle TCP flows,
from Florian Westphal.
* Exclude IPVS from struct net if IPVS is disabled and removal
of unnecessary included header file, from JunweiZhang.
* Release SCTP connection immediately under load, to mimic current
TCP behaviour, from Julian Anastasov.
* Replace and enhance SCTP state machine, from Julian Anastasov.
* Add tweak to reduce sync traffic in the presence of persistence,
also from Julian Anastasov.
* Add tweak for the IPVS SH scheduler not to reject connections
directed to a server, choose a new one instead, from Alexander
Frolkin.
* Add support for sloppy TCP and SCTP modes, that creates state
information on any packet, not only initial handshake packets,
from Alexander Frolkin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The common case is that TCP/IP checksums have already been
verified, e.g. by hardware (rx checksum offload), or conntrack.
Userspace can use this flag to determine when the checksum
has not been validated yet.
If the flag is set, this doesn't necessarily mean that the packet has
an invalid checksum, e.g. if NIC doesn't support rx checksum.
Userspace that sucessfully enabled NFQA_CFG_F_GSO queue feature flag can
infer that IP/TCP checksum has already been validated if either the
SKB_INFO attribute is not present or the NFQA_SKB_CSUM_NOTVERIFIED
flag is unset.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Combine the multiple pr_debugs in bond_set_dev_addr into one pr_debug.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marc Kleine-Budde says:
====================
this is a pull-request for net-next/master. It consists of three
patches by Fabio Estevam and me, which convert the flexcan transceiver
switching to DT[1] and a patch by Sachin Kamat, which cleans up the
at91_can driver a bit.
[1] These patches touch arch/arm/mach-imx, so I collected Acked-bys
from Shawn Guo and Sascha Hauer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use standard PM state macros PCI_Dx instead of numeric 0/1/2..
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Use standard PM state macros PCI_Dx instead of numeric 0/1/2..
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the following warning introduced in e4fc408e0e
("packet: nlmon: virtual netlink monitoring device for packet
sockets") reported by Dan Carpenter:
warning: "drivers/net/nlmon.c:31 nlmon_is_valid_mtu()
warn: always true condition '(new_mtu <= ((~0 >> 1))) =>
(s32min-s32max <= s32max)'"
Thus, we should simply remove the test against INT_MAX. Next to that
we also need to explicitly cast the sizeof() case as the comparison
is type promoted to unsigned long so negative values are then
valid instead of invalid. While at it, this also adds a comment about
Netlink and MTUs.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cosmetic patch to add a newline after logging the device's MACID.
Signed-off-by: Daniel Mack <zonque@gmail.com>
Acked-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes the error handling much more simpler than open-coding everything and
in addition makes the probe function smaller an tidier.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We may use nice macros to prefix our messages with proper device name.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no much sense to mark functions inline that are going to be used in
the other compile modules.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I tested with the AX88179 usb dongle, if without .reset_resume hook,
after S3/S4 resume you have to enable network interface or reload the
dirver module manually otherwise the network interface can not work.
Signed-off-by: David Chang <dchang@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct a typo in description of driver_info, it should be Gigabit
Signed-off-by: David Chang <dchang@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d2d68ba9 (ipv4: Cache input routes in fib_info nexthops)
assmued that "locally destined, and routed packets, never trigger
PMTU events or redirects that will be processed by us".
However, it seems that tunnel devices do trigger PMTU events in certain
cases. At least ip_gre, ip6_gre, sit, and ipip do use the inner flow's
skb_dst(skb)->ops->update_pmtu to propage mtu information from the
outer flows. These can cause the inner flow mtu to be decreased. If
next hop exceptions are not consulted for pmtu, IP fragmentation will
not be done properly for these routes.
It also seems that we really need to have the PMTU information always
for netfilter TCPMSS clamp-to-pmtu feature to work properly.
So for the time being, cache separate copies of input routes for
each next hop exception.
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC3590/RFC3810 specifies we should resend MLD reports as soon as a
valid link-local address is available.
We now use the valid_ll_addr_cnt to check if it is necessary to resend
a new report.
Changes since Flavio Leitner's version:
a) adapt for valid_ll_addr_cnt
b) resend first reports directly in the path and just arm the timer for
mc_qrv-1 resends.
Reported-by: Flavio Leitner <fleitner@redhat.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reduce the number of unnecessary router solicitations, MLDv2 and IGMPv3
messages we need to track the number of valid (as in non-optimistic,
no-dad-failed and non-tentative) link-local addresses. Therefore, this
patch implements a valid_ll_addr_cnt in struct inet6_dev.
We now only emit router solicitations if the first link-local address
finishes duplicate address detection.
The changes for MLDv2 and IGMPv3 are in a follow-up patch.
While there, also simplify one if statement(one minor nit I made in one
of my previous patches):
if (!...)
do();
else
return;
<<into>>
if (...)
return;
do();
Cc: Flavio Leitner <fbl@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David Stevens <dlstevens@us.ibm.com>
Suggested-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A simple semantic change, when a slave's MAC is cloned by the bond
master then set addr_assign_type to NET_ADDR_STOLEN instead of
NET_ADDR_SET. Also use bond_set_dev_addr() in BOND_FOM_ACTIVE mode
to change the bond's MAC address because the assign_type has to be
set properly.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In struct bonding there's a member called dev_addr_from_first which is
used to denote when the bond dev should clone the first slave's MAC
address but since we have netdev's addr_assign_type variable that is not
necessary. We clone the first slave's MAC each time we have a random MAC
set to the bond device. This has the nice side-effect of also fixing an
inconsistency - when the MAC address of the bond dev is set after its
creation, but prior to having slaves, it's not kept and the first slave's
MAC is cloned. The only way to keep the MAC was to create the bond device
with the MAC address set (e.g. through ip link). In all cases if the
bond device is left without any slaves - its MAC gets reset to a random
one as before.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have a member called setup_by_slave in struct bonding to denote if the
bond dev has different type than ARPHRD_ETHER, but that is already denoted
in bond's netdev type variable if it was setup by the slave, so use that
instead of the member.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since (c05cdb1 netlink: allow large data transfers from user-space),
netlink splats if it invokes skb_clone on large netlink skbs since:
* skb_shared_info was not correctly initialized.
* skb->destructor is not set in the cloned skb.
This was spotted by trinity:
[ 894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
[ 894.991034] IP: [<ffffffff81a212c4>] skb_clone+0x24/0xc0
[...]
[ 894.991034] Call Trace:
[ 894.991034] [<ffffffff81ad299a>] nl_fib_input+0x6a/0x240
[ 894.991034] [<ffffffff81c3b7e6>] ? _raw_read_unlock+0x26/0x40
[ 894.991034] [<ffffffff81a5f189>] netlink_unicast+0x169/0x1e0
[ 894.991034] [<ffffffff81a601e1>] netlink_sendmsg+0x251/0x3d0
Fix it by:
1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
that sets our special skb->destructor in the cloned skb. Moreover, handle
the release of the large cloned skb head area in the destructor path.
2) not allowing large skbuffs in the netlink broadcast path. I cannot find
any reasonable use of the large data transfer using netlink in that path,
moreover this helps to skip extra skb_clone handling.
I found two more netlink clients that are cloning the skbs, but they are
not in the sendmsg path. Therefore, the sole client cloning that I found
seems to be the fib frontend.
Thanks to Eric Dumazet for helping to address this issue.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>