SFQ as implemented in Linux is very limited, with at most 127 flows
and limit of 127 packets. [ So if 127 flows are active, we have one
packet per flow ]
This patch brings to SFQ following features to cope with modern needs.
- Ability to specify a smaller per flow limit of inflight packets.
(default value being at 127 packets)
- Ability to have up to 65408 active flows (instead of 127)
- Ability to have head drops instead of tail drops
(to drop old packets from a flow)
Example of use : No more than 20 packets per flow, max 8000 flows, max
20000 packets in SFQ qdisc, hash table of 65536 slots.
tc qdisc add ... sfq \
flows 8000 \
depth 20 \
headdrop \
limit 20000 \
divisor 65536
Ram usage :
2 bytes per hash table entry (instead of previous 1 byte/entry)
32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
better cache hit ratio.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a new ndo_get_fcoe_hbainfo() call in
net_device_ops for FCoE protocol stack.
If supported by the underlying device, the FCoE protocol
stack will call this to get device specific information
from the underlying device.
This information will then be utilized by the FCoE protocol
stack to register Fiber Channel HBA attributes with the
Fiber Channel Management Service via Fabric Device
Management Interface (FDMI) as per the T11 FC-GS
specification.
Changes in v2:
- As per comments from David Miller aligning the parameters
of the ndo_get_fcoe_hbainfo()
Signed-off-by: Neerav Parikh <Neerav.Parikh@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently Dave noticed that a test we did in ipv6_add_addr to see if we next hop
route for the interface we're adding an addres to was wrong (see commit
7ffbcecbee). for one, it never triggers, and two,
it was completely wrong to begin with. This test was meant to cover this
section of RFC 4429:
3.3 Modifications to RFC 2462 Stateless Address Autoconfiguration
* (modifies section 5.5) A host MAY choose to configure a new address
as an Optimistic Address. A host that does not know the SLLAO
of its router SHOULD NOT configure a new address as Optimistic.
A router SHOULD NOT configure an Optimistic Address.
This patch should bring us into proper compliance with the above clause. Since
we only add a SLAAC address after we've received a RA which may or may not
contain a source link layer address option, we can pass a pointer to that option
to addrconf_prefix_rcv (which may be null if the option is not present), and
only set the optimistic flag if the option was found in the RA.
Change notes:
(v2) modified the new parameter to addrconf_prefix_rcv to be a bool rather than
a pointer to make its use more clear as per request from davem.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The nfcid1 is the NFC-A identifier.
It is exported as an attribute of the target info
(returned as a response to NFC_CMD_GET_TARGET).
Signed-off-by: Ilan Elias <ilane@ti.com>
Acked-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Add support for NCI Interface Error Notification.
When this notification is received and we're during a
data exchange transaction, indicate an error to the NFC
core layer via the data exchange callback.
Signed-off-by: Ilan Elias <ilane@ti.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Addition, deletion, and modification of NCI constants.
Changes in NCI commands, responses, and notifications structures.
Signed-off-by: Ilan Elias <ilane@ti.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
All implementations have been converted to implement set_rxnfc
instead.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define special location values for RX NFC that request the driver to
select the actual rule location. This allows for implementation on
devices that use hash-based filter lookup, whereas currently the API is
more suited to devices with TCAM lookup or linear search.
In ethtool_set_rxnfc() and the compat wrapper ethtool_ioctl(), copy
the structure back to user-space after insertion so that the actual
location is returned.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SMSC generation 4 LAN chips integrate an IEEE 802.3 ethernet physical layer.
The ethernet driver for this family of devices needs to access the SMSC PHY
registers and bit-fields.
So, this patch moves these constants to a place where it can be used for both
the PHY and LAN drivers.
Signed-off-by: Javier Martinez Canillas <javier@dowhile0.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1e39f384bb ("evm: fix build problems") makes the stub version
of security_old_inode_init_security() return 0 when CONFIG_SECURITY is
not set.
But that makes callers such as reiserfs_security_init() assume that
security_old_inode_init_security() has set name, value, and len
arguments properly - but security_old_inode_init_security() left them
uninitialized which then results in interesting failures.
Revert security_old_inode_init_security() to the old behavior of
returning EOPNOTSUPP since both callers (reiserfs and ocfs2) handle this
just fine.
[ Also fixed the S_PRIVATE(inode) case of the actual non-stub
security_old_inode_init_security() function to return EOPNOTSUPP
for the same reason, as pointed out by Mimi Zohar.
It got incorrectly changed to match the new function in commit
fb88c2b6cbb1: "evm: fix security/security_old_init_security return
code". - Linus ]
Reported-by: Jorge Bastos <mysql.jorge@decimal.pt>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should not forget to try for real server with port 0
in the backup server when processing the sync message. We should
do it in all cases because the backup server can use different
forwarding method.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
During some debugging I needed to look into how /proc/net/ipv6_route
operated and in my digging I found its calling fib6_clean_all() which uses
"write_lock_bh(&table->tb6_lock)" before doing the walk of the table. I
found this on 2.6.32, but reading the code I believe the same basic idea
exists currently. Looking at the rtnetlink code they are only calling
"read_lock_bh(&table->tb6_lock);" via fib6_dump_table(). While I realize
reading from proc isn't the recommended way of fetching the ipv6 route
table; taking a write lock seems unnecessary and would probably cause
network performance issues.
To verify this I loaded up the ipv6 route table and then ran iperf in 3
cases:
* doing nothing
* reading ipv6 route table via proc
(while :; do cat /proc/net/ipv6_route > /dev/null; done)
* reading ipv6 route table via rtnetlink
(while :; do ip -6 route show table all > /dev/null; done)
* Load the ipv6 route table up with:
* for ((i = 0;i < 4000;i++)); do ip route add unreachable 2000::$i; done
* iperf commands:
* client: iperf -i 1 -V -c <ipv6 addr>
* server: iperf -V -s
* iperf results - 3 runs each (in Mbits/sec)
* nothing: client: 927,927,927 server: 927,927,927
* proc: client: 179,97,96,113 server: 142,112,133
* iproute: client: 928,927,928 server: 927,927,927
lock_stat shows taking the write lock is causing the slowdown. Using this
info I decided to write a version of fib6_clean_all() which replaces
write_lock_bh(&table->tb6_lock) with read_lock_bh(&table->tb6_lock). With
this new function I see the same results as with my rtnetlink iperf test.
Signed-off-by: Josh Hunt <joshhunt00@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While it's not too late fix the recently added RQLEN diag extension
to report rqlen and wqlen in the same way as TCP does.
I.e. for listening sockets the ack backlog length (which is the input
queue length for socket) in rqlen and the max ack backlog length in
wqlen, and what the CINQ/OUTQ ioctls do for established.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently tcp diag reports rqlen and wqlen values similar to how
the CINQ/COUTQ iotcls do. To make unix diag report these values
in the same way move the respective code into helpers.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Fix indentation of sock_diag*() calls. -DaveM ]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a routine that dumps memory-related values of a socket.
It's made as an array to make it possible to add more stuff
here later without breaking compatibility.
Since v1: The SK_MEMINFO_ constants are in userspace
visible part of sock_diag.h, the rest is under __KERNEL__.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The headers check complains it should include the linux/types.h
withing, thus add this one.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Properly toss existing components around the ifdef __KERNEL__
and include the header into the header-y target.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 2a95ea6c0d ("procfs: do not overflow get_{idle,iowait}_time
for nohz") did not take into account that one some architectures jiffies
and cputime use different units.
This causes get_idle_time() to return numbers in the wrong units, making
the idle time fields in /proc/stat wrong.
Instead of converting the usec value returned by
get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It just obscures that the netdevice pointer and the expires value are
implemented in the dst_entry sub-object of the ipv6 route.
And it makes grepping for dst_entry member uses much harder too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Also, create and use an rt6_bind_neighbour() in net/ipv6/route.c to
consolidate some common logic.
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.
Which means we need 4 different hash randoms for ipv6.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the new macro and struct names in xt_ecn.h, and put the old
definitions into a definition-forwarding ipt_ecn.h.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Prepare the ECN match for augmentation by an IPv6 counterpart. Since
no symbol dependencies to ipv6.ko are added, having a single ecn match
module is the more so welcome.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.
This is broken in several ways:
- if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
deadline timer feature, a guest that uses the feature will break
- live migration to older host kernels that don't support the TSC deadline
timer will cause the feature to be pulled from under the guest's feet;
breaking it
- guests that are broken wrt the feature will fail.
Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.
Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
[avi: add the KVM_CAP + documentation]
Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We currently have two ways to account traffic in netfilter:
- iptables chain and rule counters:
# iptables -L -n -v
Chain INPUT (policy DROP 3 packets, 867 bytes)
pkts bytes target prot opt in out source destination
8 1104 ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
- use flow-based accounting provided by ctnetlink:
# conntrack -L
tcp 6 431999 ESTABLISHED src=192.168.1.130 dst=212.106.219.168 sport=58152 dport=80 packets=47 bytes=7654 src=212.106.219.168 dst=192.168.1.130 sport=80 dport=58152 packets=49 bytes=66340 [ASSURED] mark=0 use=1
While trying to display real-time accounting statistics, we require
to pool the kernel periodically to obtain this information. This is
OK if the number of flows is relatively low. However, in case that
the number of flows is huge, we can spend a considerable amount of
cycles to iterate over the list of flows that have been obtained.
Moreover, if we want to obtain the sum of the flow accounting results
that match some criteria, we have to iterate over the whole list of
existing flows, look for matchings and update the counters.
This patch adds the extended accounting infrastructure for
nfnetlink which aims to allow displaying real-time traffic accounting
without the need of complicated and resource-consuming implementation
in user-space. Basically, this new infrastructure allows you to create
accounting objects. One accounting object is composed of packet and
byte counters.
In order to manipulate create accounting objects, you require the
new libnetfilter_acct library. It contains several examples of use:
libnetfilter_acct/examples# ./nfacct-add http-traffic
libnetfilter_acct/examples# ./nfacct-get
http-traffic = { pkts = 000000000000, bytes = 000000000000 };
Then, you can use one of this accounting objects in several iptables
rules using the new nfacct match (which comes in a follow-up patch):
# iptables -I INPUT -p tcp --sport 80 -m nfacct --nfacct-name http-traffic
# iptables -I OUTPUT -p tcp --dport 80 -m nfacct --nfacct-name http-traffic
The idea is simple: if one packet matches the rule, the nfacct match
updates the counters.
Thanks to Patrick McHardy, Eric Dumazet, Changli Gao for reviewing and
providing feedback for this contribution.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Aim of this patch is to provide full range of rps_flow_cnt on 64bit arches.
Theorical limit on number of flows is 2^32
Fix some buggy RPS/RFS macros as well.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Xi Wang <xi.wang@gmail.com>
CC: Laurent Chavey <chavey@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to accommodate a 64K buffer we need 64K/PAGE_SIZE plus one more page
in order to allow for a buffer which does not start on a page boundary.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NAT range to nlattr conversation callbacks and helpers are entirely
dead code and are also useless since there are no NAT ranges in conntrack
context, they are only used for initially selecting a tuple. The final NAT
information is contained in the selected tuples of the conntrack entry.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The only remaining user of NAT protocol module reference counting is NAT
ctnetlink support. Since this is a fairly short sequence of code, convert
over to use RCU and remove module reference counting.
Module unregistration is already protected by RCU using synchronize_rcu(),
so no further changes are necessary.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Export the NAT definitions to userspace. So far userspace (specifically,
iptables) has been copying the headers files from include/net. Also
rename some structures and definitions in preparation for IPv6 NAT.
Since these have never been officially exported, this doesn't affect
existing userspace code.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This partially reworks bc01befdcf
which added userspace expectation support.
This patch removes the nf_ct_userspace_expect_list since now we
force to use the new iptables CT target feature to add the helper
extension for conntracks that have attached expectations from
userspace.
A new version of the proof-of-concept code to implement userspace
helpers from userspace is available at:
http://people.netfilter.org/pablo/userspace-conntrack-helpers/nf-ftp-helper-POC.tar.bz2
This patch also modifies the CT target to allow to set the
conntrack's userspace helper status flags. This flag is used
to tell the conntrack system to explicitly allocate the helper
extension.
This helper extension is useful to link the userspace expectations
with the master conntrack that is being tracked from one userspace
helper.
This feature fixes a problem in the current approach of the
userspace helper support. Basically, if the master conntrack that
has got a userspace expectation vanishes, the expectations point to
one invalid memory address. Thus, triggering an oops in the
expectation deletion event path.
I decided not to add a new revision of the CT target because
I only needed to add a new flag for it. I'll document in this
issue in the iptables manpage. I have also changed the return
value from EINVAL to EOPNOTSUPP if one flag not supported is
specified. Thus, in the future adding new features that only
require a new flag can be added without a new revision.
There is no official code using this in userspace (apart from
the proof-of-concept) that uses this infrastructure but there
will be some by beginning 2012.
Reported-by: Sam Roberts <vieuxtech@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb->truesize might be big even for a small packet.
Its even bigger after commit 87fb4b7b53 (net: more accurate skb
truesize) and big MTU.
We should allow queueing at least one packet per receiver, even with a
low RCVBUF setting.
Reported-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
They don't need to disable interrupts anymore, we only run in process
context now.
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
To achive Man-In-The-Middle (MITM) level security with Low Energy,
we have to enable User Passkey Comparison. This commit modifies the
hard-coded JUST-WORKS pairing mechanism to support query via the MGMT
interface of Passkey comparison and User Confirmation.
Signed-off-by: Brian Gix <bgix@codeaurora.org>
Acked-by: Marcel Holtmann<marcel@holtmann.org>
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
The struct hci_proto and all related register/unregister and dispatching
code was removed. HCI core code now call directly the SCO and L2CAP
event functions.
Signed-off-by: Ulisses Furquim <ulisses@profusion.mobi>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>