Here is a respin of patch.
I'll send a short patch to make SFQ more fair in presence of large
packets as well.
Thanks
[PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
This patch shrinks sizeof(struct sfq_sched_data)
from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
reduce text size as well.
text data bss dec hex filename
4821 152 0 4973 136d old/net/sched/sch_sfq.o
4627 136 0 4763 129b new/net/sched/sch_sfq.o
All data for a slot/flow is now grouped in a compact and cache friendly
structure, instead of being spreaded in many different points.
struct sfq_slot {
struct sk_buff *skblist_next;
struct sk_buff *skblist_prev;
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
unsigned short hash; /* hash value (index in ht[]) */
short allot; /* credit for this slot */
};
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jarek Poplawski <jarkao2@gmail.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
When deploying SFQ/IFB here at work, I found the allot management was
pretty wrong in sfq, even changing allot from short to int...
We should init allot for each new flow, not using a previous value found
in slot.
Before patch, I saw bursts of several packets per flow, apparently
denying the default "quantum 1514" limit I had on my SFQ class.
class sfq 11:1 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 7p requeues 0
allot 11546
class sfq 11:46 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 1p requeues 0
allot -23873
class sfq 11:78 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 5p requeues 0
allot 11393
After patch, better fairness among each flow, allot limit being
respected, allot is positive :
class sfq 11:e parent 11:
(dropped 0, overlimits 0 requeues 86)
backlog 0b 3p requeues 86
allot 596
class sfq 11:94 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1468
class sfq 11:a4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 4p requeues 0
allot 650
class sfq 11:bb parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 596
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently return for each active SFQ slot the number of packets in
queue. We can also give number of bytes accounted for these packets.
tc -s class show dev ifb0
Before patch :
class sfq 11:3d9 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1266
After patch :
class sfq 11:3e4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 4380b 3p requeues 0
allot 1212
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flush the routing cache only of entries that match the
network namespace in which the purge event occurred.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
hna_local_fill_buffer must return the number of added hna entries and
not the last checked hash bucket.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove "pktgen: " prefix string from one pr_info.
pr_fmt adds it, so this is a duplicate.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dev_queue_xmit_nit(), we have to clone skbs as we need to mangle skbs,
however, we don't need to clone skbs for all the packet_types.
Except for the first packet_type, we increase skb->users instead of
skb_clone().
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch modifies IPsec6 to fragment IPv6 packets that are
locally generated as needed.
This version of the patch only fragments in tunnel mode, so that fragment
headers will not be obscured by ESP in transport mode.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When device is being set to down, neigh_ifdown was being called
twice. Once from addrconf notifier and once from ndisc notifier.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove (unnecessary) casts to make code cleaner.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When loopback device is being brought down, then keep the route table
entries because they are special. The entries in the local table for
linklocal routes and ::1 address should not be purged.
This is a sub optimal solution to the problem and should be replaced
by a better fix in future.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get the sctp partial delivery point using SCTP_PARTIAL_DELIVERY_POINT
socket option should return 0 if success, not -ENOTSUPP.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().
Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a missing ntohs() for bridge IPv6 multicast snooping.
Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.
The patch fixes the following crash:
BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
[<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
[<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
[<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
[<ffffffff812eda35>] udp_rcv+0x15/0x20
[<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
[<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
[<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
[<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
[<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0
Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dev_close_many and dev_deactivate_many to factorize another
sync-rcu operation on the netdevice unregister path.
$ modprobe dummy numdummies=10000
$ ip link set dev dummy* up
$ time rmmod dummy
Without the patch With the patch
real 0m 24.63s real 0m 5.15s
user 0m 0.00s user 0m 0.00s
sys 0m 6.05s sys 0m 5.14s
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
B.A.T.M.A.N. (better approach to mobile ad-hoc networking) is a routing
protocol for multi-hop ad-hoc mesh networks. The networks may be wired or
wireless. See http://www.open-mesh.org/ for more information and user space
tools.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the calcualation of the Tx hash for a given hash range into a separate
function and define the skb_tx_hash(), which calculates a Tx hash for a
[0; dev->real_num_tx_queues - 1] hash values range, using this
function (__skb_tx_hash()).
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first big packets sent to a "low-MTU" client correctly
triggers the creation of a temporary route containing the reduced MTU.
But after the temporary route has expired, new ICMP6 "packet too big"
will be sent, rt6_pmtu_discovery will find the previous EXPIRED route
check that its mtu isn't bigger then in icmp packet and do nothing
before the temporary route will not deleted by gc.
I make the simple experiment:
while :; do
time ( dd if=/dev/zero bs=10K count=1 | ssh hostname dd of=/dev/null ) || break;
done
The "time" reports real 0m0.197s if a temporary route isn't expired, but
it reports real 0m52.837s (!!!!) immediately after a temporare route has
expired.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.
Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.
Signed-off-by: David S. Miller <davem@davemloft.net>
Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().
Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.
For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.
Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates. This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.
Signed-off-by: David S. Miller <davem@davemloft.net>
On suspend, there might be usb wireless drivers which wrongly trigger
the warning in ieee80211_work_work. If an usb driver doesn't have a
suspend hook, the usb stack will disconnect the device. On disconnect,
a mac80211 driver calls ieee80211_unregister_hw, which calls dev_close,
which calls ieee80211_stop, and in the end calls ieee80211_work_purge->
ieee80211_work_work.
The problem is that this call to ieee80211_work_purge comes after
mac80211 is suspended, triggering the warning even when we don't have
work queued in work_list (the expected case when already suspended),
because it always calls ieee80211_work_work.
So, just call ieee80211_work_work in ieee80211_work_purge if we really
have to abort work. This addresses the warning reported at
https://bugzilla.kernel.org/show_bug.cgi?id=24402
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
dev_open will eventually call ieee80211_ibss_join which sets up the
skb used for beacons/probe-responses however it is possible to
receive beacons that attempt to merge before this occurs causing
a null pointer dereference. Check ssid_len as that is the last
thing set in ieee80211_ibss_join.
This occurs quite easily in the presence of adhoc nodes with hidden SSID's
revised previous patch to check further up based on irc feedback
Signed-off-by: Tim Harvey <harvey.tim@gmail.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Always go through a new ip4_dst_hoplimit() helper, just like ipv6.
This allowed several simplifications:
1) The interim dst_metric_hoplimit() can go as it's no longer
userd.
2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.
3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.
When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.
We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h
Signed-off-by: David S. Miller <davem@davemloft.net>
This is for consistency with ipv4. Using "-1" makes
no sense.
It was made this way a long time ago merely to be consistent
with how the ipv6 socket hoplimit "default" is stored.
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit c1f19b51d1 (net: support time stamping in phy devices.),
kernel might crash if CONFIG_NETWORK_PHY_TIMESTAMPING=y and
skb_defer_rx_timestamp() handles a packet without an ethernet header.
Fixes kernel bugzilla #24102
Reference: https://bugzilla.kernel.org/show_bug.cgi?id=24102
Reported-and-tested-by: Andrew Watts <akwatts@ymail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nf_pre_routing functions in bridging have collected two
distinct ways of returning NF_DROP over the years, inline and
via goto. There is no reason for preferring either one.
So this patch arbitrarily picks the inline variant and converts
the all the gotos.
Also removes a redundant comment.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
While an interface is down, many implementations of
ethtool_ops::get_link, including the default, ethtool_op_get_link(),
will report the last link state seen while the interface was up. In
general the current physical link state is not available if the
interface is down.
Define ETHTOOL_GLINK to reflect whether the interface *and* any
physical port have a working link, and consistently return 0 when the
interface is down.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ATM subsystem was incorrectly creating the 'device' link for ATM
nodes in sysfs. This led to incorrect device/parent relationships
exposed by sysfs and udev. Instead of rolling the 'device' link by hand
in the generic ATM code, pass each ATM driver's bus device down to the
sysfs code and let sysfs do this stuff correctly.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We know for sure pktgen is going to write skb->data right after
*_alloc_skb, causing unnecessary cache misses.
Idea is to add a prefetchw() call to prefetch the first cache line
indicated by skb->data. On systems with Adjacent Cache Line Prefetch,
it's probably two cache lines are prefetched.
With this patch, pktgen on Intel SR1625 server with two E5530
quad-core processors and a single ixgbe-based NIC went from 8.63Mpps
to 9.03Mpps, with 4.6% improvement.
Signed-off-by: Junchang Wang <junchangwang@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP_SET_PEER_PRIMARY_ADDR does not accpet v4mapped address, using
v4mapped address in SCTP_SET_PEER_PRIMARY_ADDR socket option will
get -EADDRNOTAVAIL error if v4map is enabled. This patch try to
fix it by mapping v4mapped address to v4 address if allowed.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add TFC padding to all packets smaller than the boundary configured
on the xfrm state. If the boundary is larger than the PMTU, limit
padding to the PMTU.
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add TFC padding to all packets smaller than the boundary configured
on the xfrm state. If the boundary is larger than the PMTU, limit
padding to the PMTU.
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The XFRMA_TFCPAD attribute for XFRM state installation configures
Traffic Flow Confidentiality by padding ESP packets to a specified
length.
Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb1 should be passed as parameter to sk_rcvqueues_full() here.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If br_multicast_new_group returns NULL, we would return 0 (no error) to
the caller of br_multicast_add_group, which is not what we want. Instead
br_multicast_new_group should return ERR_PTR(-ENOMEM) in this case.
Also propagate the error number returned by br_mdb_rehash properly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>