Pull networking updates from David Miller:
1) Unified UDP encapsulation offload methods for drivers, from
Alexander Duyck.
2) Make DSA binding more sane, from Andrew Lunn.
3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.
4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.
5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
packets as soon as the device sees them, with the option to mirror
the packet on TX via the same interface. From Brenden Blanco and
others.
6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.
7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.
8) Simplify netlink conntrack entry layout, from Florian Westphal.
9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
Schimmel, Yotam Gigi, and Jiri Pirko.
10) Add SKB array infrastructure and convert tun and macvtap over to it.
From Michael S Tsirkin and Jason Wang.
11) Support qdisc packet injection in pktgen, from John Fastabend.
12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.
13) Add NV congestion control support to TCP, from Lawrence Brakmo.
14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.
15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.
16) Support MPLS over IPV4, from Simon Horman.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
xgene: Fix build warning with ACPI disabled.
be2net: perform temperature query in adapter regardless of its interface state
l2tp: Correctly return -EBADF from pppol2tp_getname.
net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
net: ipmr/ip6mr: update lastuse on entry change
macsec: ensure rx_sa is set when validation is disabled
tipc: dump monitor attributes
tipc: add a function to get the bearer name
tipc: get monitor threshold for the cluster
tipc: make cluster size threshold for monitoring configurable
tipc: introduce constants for tipc address validation
net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
MAINTAINERS: xgene: Add driver and documentation path
Documentation: dtb: xgene: Add MDIO node
dtb: xgene: Add MDIO node
drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
drivers: net: xgene: Use exported functions
drivers: net: xgene: Enable MDIO driver
drivers: net: xgene: Add backward compatibility
drivers: net: phy: xgene: Add MDIO driver
...
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Unix sockets can consume a significant amount of system memory, hence
they should be accounted to kmemcg.
Since unix socket buffers are always allocated from process context, all
we need to do to charge them to kmemcg is set __GFP_ACCOUNT in
sock->sk_allocation mask.
Eric asked:
> 1) What happens when a buffer, allocated from socket <A> lands in a
> different socket <B>, maybe owned by another user/process.
>
> Who owns it now, in term of kmemcg accounting ?
We never move memcg charges. E.g. if two processes from different
cgroups are sharing a memory region, each page will be charged to the
process which touched it first. Or if two processes are working with
the same directory tree, inodes and dentries will be charged to the
first user. The same is fair for unix socket buffers - they will be
charged to the sender.
> 2) Has performance impact been evaluated ?
I ran netperf STREAM_STREAM with default options in a kmemcg on a 4 core
x2 HT box. The results are below:
# clients bandwidth (10^6bits/sec)
base patched
1 67643 +- 725 64874 +- 353 - 4.0 %
4 193585 +- 2516 186715 +- 1460 - 3.5 %
8 194820 +- 377 187443 +- 1229 - 3.7 %
So the accounting doesn't come for free - it takes ~4% of performance.
I believe we could optimize it by using per cpu batching not only on
charge, but also on uncharge in memcg core, but that's beyond the scope
of this patch set - I'll take a look at this later.
Anyway, if performance impact is found to be unacceptable, it is always
possible to disable kmem accounting at boot time (cgroup.memory=nokmem)
or not use memory cgroups at runtime at all (thanks to jump labels
there'll be no overhead even if they are compiled in).
Link: http://lkml.kernel.org/r/fcfe6cae27a59fbc5e40145664b3cf085a560c68.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If 'tunnel' is NULL we should return -EBADF but the 'end_put_sess' path
unconditionally sets 'error' back to zero. Rework the error path so it
more closely matches pppol2tp_sendmsg.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently lastuse is updated on entry creation and cache hit, but it should
also be updated on entry change. Since both on add and update the ttl array
is updated we can simply update the lastuse in ipmr_update_thresholds.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we dump the monitor attributes when queried.
The link monitor attributes are separated into two kinds:
1. general attributes per bearer
2. specific attributes per node/peer
This style resembles the socket attributes and the nametable
publications per socket.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new function to get the bearer name from
its id. This is used in subsequent commit.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we add support to fetch the configured
cluster monitoring threshold.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we introduce support to configure the minimum
threshold to activate the new link monitoring algorithm.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In this commit, we introduce defines for tipc address size,
offset and mask specification for Zone.Cluster.Node.
There is no functional change in this commit.
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NUD_STALE is used when the caller(e.g. arp_process()) can't guarantee
neighbour reachability. If the entry was NUD_VALID and lladdr is unchanged,
the entry state should not be changed.
Currently the code puts an extra "NUD_CONNECTED" condition. So if old state
was NUD_DELAY or NUD_PROBE (they are NUD_VALID but not NUD_CONNECTED), the
state can be changed to NUD_STALE.
This may cause problem. Because NUD_STALE lladdr doesn't guarantee
reachability, when we send traffic, the state will be changed to
NUD_DELAY. In normal case, if we get no confirmation (by dst_confirm()),
we will change the state to NUD_PROBE and send probe traffic. But now the
state may be reset to NUD_STALE again(e.g. by broadcast ARP packets),
so the probe traffic will not be sent. This situation may happen again and
again, and packets will be sent to an non-reachable lladdr forever.
The fix is to remove the "NUD_CONNECTED" condition. After that the
"NEIGH_UPDATE_F_WEAK_OVERRIDE" condition (used by IPv6) in that branch will
be redundant, so remove it.
This change may increase probe traffic, but it's essential since NUD_STALE
lladdr is unreliable. To ensure correctness, we prefer to resolve lladdr,
when we can't get confirmation, even while remote packets try to set
NUD_STALE state.
Signed-off-by: Chunhui He <hchunhui@mail.ustc.edu.cn>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull s390 updates from Martin Schwidefsky:
"There are a couple of new things for s390 with this merge request:
- a new scheduling domain "drawer" is added to reflect the unusual
topology found on z13 machines. Performance tests showed up to 8
percent gain with the additional domain.
- the new crc-32 checksum crypto module uses the vector-galois-field
multiply and sum SIMD instruction to speed up crc-32 and crc-32c.
- proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
the generic vmlinux.lds linker script definitions.
- kcov instrumentation support. A prerequisite for that is the
inline assembly basic block cleanup, which is the reason for the
net/iucv/iucv.c change.
- support for 2GB pages is added to the hugetlbfs backend.
Then there are two removals:
- the oprofile hardware sampling support is dead code and is removed.
The oprofile user space uses the perf interface nowadays.
- the ETR clock synchronization is removed, this has been superseeded
be the STP clock synchronization. And it always has been
"interesting" code..
And the usual bug fixes and cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
s390/smp: clean up a condition
s390/cio/chp : Remove deprecated create_singlethread_workqueue
s390/chsc: improve channel path descriptor determination
s390/chsc: sanitize fmt check for chp_desc determination
s390/cio: make fmt1 channel path descriptor optional
s390/chsc: fix ioctl CHSC_INFO_CU command
s390/cio/device_ops: fix kernel doc
s390/cio: allow to reset channel measurement block
s390/console: Make preferred console handling more consistent
s390/mm: fix gmap tlb flush issues
s390/mm: add support for 2GB hugepages
s390: have unique symbol for __switch_to address
s390/cpuinfo: show maximum thread id
s390/ptrace: clarify bits in the per_struct
s390: stack address vs thread_info
s390: remove pointless load within __switch_to
s390: enable kcov support
s390/cpumf: use basic block for ecctr inline assembly
s390/hypfs: use basic block for diag inline assembly
...
After the previous patch, struct tc_action should be enough
to represent the generic tc action, tcf_common is not necessary
any more. This patch gets rid of it to make tc action code
more readable.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tc_action is confusing, currently we use it for two purposes:
1) Pass in arguments and carry out results from helper functions
2) A generic representation for tc actions
The first one is error-prone, since we need to make sure we don't
miss anything. This patch aims to get rid of this use, by moving
tc_action into tcf_common, so that they are allocated together
in hashtable and can be cast'ed easily.
And together with the following patch, we could really make
tc_action a generic representation for all tc actions and each
type of action can inherit from it.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After a612769774 ("udp: prevent bugcheck if filter truncates packet
too much"), there followed various other fixes for similar cases such
as f4979fcea7 ("rose: limit sk_filter trim to payload").
Latter introduced a new helper sk_filter_trim_cap(), where we can pass
the trim limit directly to the socket filter handling. Make use of it
here as well with sizeof(struct udphdr) as lower cap limit and drop the
extra skb->len test in UDP's input path.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer updates from Thomas Gleixner:
"This update provides the following changes:
- The rework of the timer wheel which addresses the shortcomings of
the current wheel (cascading, slow search for next expiring timer,
etc). That's the first major change of the wheel in almost 20
years since Finn implemted it.
- A large overhaul of the clocksource drivers init functions to
consolidate the Device Tree initialization
- Some more Y2038 updates
- A capability fix for timerfd
- Yet another clock chip driver
- The usual pile of updates, comment improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
tick/nohz: Optimize nohz idle enter
clockevents: Make clockevents_subsys static
clocksource/drivers/time-armada-370-xp: Fix return value check
timers: Implement optimization for same expiry time in mod_timer()
timers: Split out index calculation
timers: Only wake softirq if necessary
timers: Forward the wheel clock whenever possible
timers/nohz: Remove pointless tick_nohz_kick_tick() function
timers: Optimize collect_expired_timers() for NOHZ
timers: Move __run_timers() function
timers: Remove set_timer_slack() leftovers
timers: Switch to a non-cascading wheel
timers: Reduce the CPU index space to 256k
timers: Give a few structs and members proper names
hlist: Add hlist_is_singular_node() helper
signals: Use hrtimer for sigtimedwait()
timers: Remove the deprecated mod_timer_pinned() API
timers, net/ipv4/inet: Initialize connection request timers as pinned
timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
timers, drivers/tty/metag_da: Initialize the poll timer as pinned
...
I was seeing a lot of these:
BUG: sleeping function called from invalid context at mm/slab.h:388
in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
[<ffffffff81149abb>] preempt_count_add+0x1fb/0x280
[<ffffffff83295722>] _raw_spin_lock+0x12/0x40
[<ffffffff811aac87>] console_unlock+0x2f7/0x930
[<ffffffff811ab5bb>] vprintk_emit+0x2fb/0x520
[<ffffffff811aba6a>] vprintk_default+0x1a/0x20
[<ffffffff812c171a>] printk+0x94/0xb0
[<ffffffff811d6ed0>] print_stack_trace+0xe0/0x170
[<ffffffff8115835e>] ___might_sleep+0x3be/0x460
[<ffffffff81158490>] __might_sleep+0x90/0x1a0
[<ffffffff8139b823>] kmem_cache_alloc+0x153/0x1e0
[<ffffffff819bca1e>] rhashtable_walk_init+0xfe/0x2d0
[<ffffffff82ec64de>] sctp_transport_walk_start+0x1e/0x60
[<ffffffff82edd8ad>] sctp_transport_seq_start+0x4d/0x150
[<ffffffff8143a82b>] seq_read+0x27b/0x1180
[<ffffffff814f97fc>] proc_reg_read+0xbc/0x180
[<ffffffff813d471b>] __vfs_read+0xdb/0x610
[<ffffffff813d4d3a>] vfs_read+0xea/0x2d0
[<ffffffff813d615b>] SyS_pread64+0x11b/0x150
[<ffffffff8100334c>] do_syscall_64+0x19c/0x410
[<ffffffff832960a5>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
Apparently we always need to call rhashtable_walk_stop(), even when
rhashtable_walk_start() fails:
* rhashtable_walk_start - Start a hash table walk
* @iter: Hash table iterator
*
* Start a hash table walk. Note that we take the RCU lock in all
* cases including when we return an error. So you must always call
* rhashtable_walk_stop to clean up.
otherwise we never call rcu_read_unlock() and we get the splat above.
Fixes: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
See-also: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag")
See-also: f2dba9c6 ("rhashtable: Introduce rhashtable_walk_*")
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull locking updates from Ingo Molnar:
"The locking tree was busier in this cycle than the usual pattern - a
couple of major projects happened to coincide.
The main changes are:
- implement the atomic_fetch_{add,sub,and,or,xor}() API natively
across all SMP architectures (Peter Zijlstra)
- add atomic_fetch_{inc/dec}() as well, using the generic primitives
(Davidlohr Bueso)
- optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
Waiman Long)
- optimize smp_cond_load_acquire() on arm64 and implement LSE based
atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
on arm64 (Will Deacon)
- introduce smp_acquire__after_ctrl_dep() and fix various barrier
mis-uses and bugs (Peter Zijlstra)
- after discovering ancient spin_unlock_wait() barrier bugs in its
implementation and usage, strengthen its semantics and update/fix
usage sites (Peter Zijlstra)
- optimize mutex_trylock() fastpath (Peter Zijlstra)
- ... misc fixes and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
locking/static_keys: Fix non static symbol Sparse warning
locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
locking/atomic, arch/tile: Fix tilepro build
locking/atomic, arch/m68k: Remove comment
locking/atomic, arch/arc: Fix build
locking/Documentation: Clarify limited control-dependency scope
locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
locking/atomic, arch/mips: Convert to _relaxed atomics
locking/atomic, arch/alpha: Convert to _relaxed atomics
locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
locking/atomic: Fix atomic64_relaxed() bits
locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
...
The head skb for GSO packets won't travel through the inner depths of
SCTP stack as it doesn't contain any chunks on it. That means skb->sk
doesn't get set and then when sctp_recvmsg() calls
sctp_inet6_skb_msgname() on the head_skb it panics, as this last needs
to check flags at the socket (sp->v4mapped).
The fix is to initialize skb->sk for th head skb once we are able to do
it. That is, when the first chunk is processed.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The check for a -ve error is redundant, remove it and just
immediately return the return value from the call to
seq_open_net.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Default kernel behavior is to delete IPv6 addresses on link
down, which entails deletion of the multicast and the
subnet-router anycast addresses. These deletions do not
happen with sysctl setting to keep global IPv6 addresses on
link down, so every link down/up causes an increment of the
anycast and multicast refcounts. These bogus refcounts may
stop these addrs from being removed on subsequent calls to
delete them. The solution is to leave the groups for the
multicast and subnet anycast on link down for the callflow
when global IPv6 addresses are kept.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Signed-off-by: Mike Manning <mmanning@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 486bdee013 ("sctp: add support for RPS and RFS")
saves skb->hash into sk->sk_rxhash so that the inet_* can
record it to flow table.
But sctp uses sock_common_recvmsg as .recvmsg instead
of inet_recvmsg, sock_common_recvmsg doesn't invoke
sock_rps_record_flow to record the flow. It may cause
that the receiver has no chances to record the flow if
it doesn't send msg or poll the socket.
So this patch fixes it by using inet_recvmsg as .recvmsg
in sctp.
Fixes: 486bdee013 ("sctp: add support for RPS and RFS")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 8626c56c82 ("bridge: fix potential use-after-free when hook
returns QUEUE or STOLEN verdict") caused LLDP packets arriving through a
bridge port to be re-injected to the Rx path with skb->dev set to the
bridge device, but this breaks the lldpad daemon.
The lldpad daemon opens a packet socket with protocol set to ETH_P_LLDP
for any valid device on the system, which doesn't not include soft
devices such as bridge and VLAN.
Since packet sockets (ptype_base) are processed in the Rx path after the
Rx handler, LLDP packets with skb->dev set to the bridge device never
reach the lldpad daemon.
Fix this by making the bridge's Rx handler re-inject LLDP packets with
RX_HANDLER_PASS, which effectively restores the behaviour prior to the
mentioned commit.
This means netfilter will never receive LLDP packets coming through a
bridge port, as I don't see a way in which we can have okfn() consume
the packet without breaking existing behaviour. I've already carried out
a similar fix for STP packets in commit 56fae404fb ("bridge: Fix
incorrect re-injection of STP packets").
Fixes: 8626c56c82 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes sctp support ipv6 nonlocal bind by adding
sp->inet.freebind and net->ipv6.sysctl.ip_nonlocal_bind
check in sctp_v6_available as what sctp did to support
ipv4 nonlocal bind (commit cdac4e0774).
Reported-by: Shijoe George <spanjikk@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the __output_custom() routine we currently use with
bpf_skb_copy(). I missed that when len is larger than the size of the
current handle, we can issue multiple invocations of copy_func, and
__output_custom() advances destination but also source buffer by the
written amount of bytes. When we have __output_custom(), this is actually
wrong since in that case the source buffer points to a non-linear object,
in our case an skb, which the copy_func helper is supposed to walk.
Therefore, since this is non-linear we thus need to pass the offset into
the helper, so that copy_func can use it for extracting the data from
the source object.
Therefore, adjust the callback signatures properly and pass offset
into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
__DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
i) to pass in whether we should advance source buffer or not; this is
a compile-time constant condition, ii) to pass in the offset for
__output_custom(), which we do with help of __VA_ARGS__, so everything
can stay inlined as is currently. Both changes allow for adapting the
__output_* fast-path helpers w/o extra overhead.
Fixes: 555c8a8623 ("bpf: avoid stack copy and use skb ctx for event output")
Fixes: 7e3f977edd ("perf, events: add non-linear data support for raw records")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc-4.9 and higher warn about the newly added NSCI code:
net/ncsi/ncsi-manage.c: In function 'ncsi_process_next_channel':
net/ncsi/ncsi-manage.c:1003:2: error: 'old_state' may be used uninitialized in this function [-Werror=maybe-uninitialized]
The warning is a false positive and therefore harmless, but it would be good to
avoid it anyway. I have determined that the barrier in the spin_unlock_irqsave()
is what confuses gcc to the point that it cannot track whether the variable
was unused or not.
This rearranges the code in a way that makes it obvious to gcc that old_state
is always initialized at the time of use, functionally this should not
change anything.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the ageing_time type in br_set_ageing_time() from u32 to what it
is expected to be, i.e. a clock_t.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_stp_enable_bridge() does take the br->lock spinlock. Fix its wrongly
pasted comment and use the same as br_stp_disable_bridge().
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following the work that have been done on offloading classifiers like u32
and flower, now the match-all classifier hw offloading is possible. if
the interface supports tc offloading.
To control the offloading, two tc flags have been introduced: skip_sw and
skip_hw. Typical usage:
tc filter add dev eth25 parent ffff: \
matchall skip_sw \
action mirred egress mirror \
dev eth27
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The matchall classifier matches every packet and allows the user to apply
actions on it. This filter is very useful in usecases where every packet
should be matched, for example, packet mirroring (SPAN) can be setup very
easily using that filter.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
they are:
1) Count pre-established connections as active in "least connection"
schedulers such that pre-established connections to avoid overloading
backend servers on peak demands, from Michal Kubecek via Simon Horman.
2) Address a race condition when resizing the conntrack table by caching
the bucket size when fulling iterating over the hashtable in these
three possible scenarios: 1) dump via /proc/net/nf_conntrack,
2) unlinking userspace helper and 3) unlinking custom conntrack timeout.
From Liping Zhang.
3) Revisit early_drop() path to perform lockless traversal on conntrack
eviction under stress, use del_timer() as synchronization point to
avoid two CPUs evicting the same entry, from Florian Westphal.
4) Move NAT hlist_head to nf_conn object, this simplifies the existing
NAT extension and it doesn't increase size since recent patches to
align nf_conn, from Florian.
5) Use rhashtable for the by-source NAT hashtable, also from Florian.
6) Don't allow --physdev-is-out from OUTPUT chain, just like
--physdev-out is not either, from Hangbin Liu.
7) Automagically set on nf_conntrack counters if the user tries to
match ct bytes/packets from nftables, from Liping Zhang.
8) Remove possible_net_t fields in nf_tables set objects since we just
simply pass the net pointer to the backend set type implementations.
9) Fix possible off-by-one in h323, from Toby DiPasquale.
10) early_drop() may be called from ctnetlink patch, so we must hold
rcu read size lock from them too, this amends Florian's patch #3
coming in this batch, from Liping Zhang.
11) Use binary search to validate jump offset in x_tables, this
addresses the O(n!) validation that was introduced recently
resolve security issues with unpriviledge namespaces, from Florian.
12) Fix reference leak to connlabel in error path of nft_ct, from Zhang.
13) Three updates for nft_log: Fix log prefix leak in error path. Bail
out on loglevel larger than debug in nft_log and set on the new
NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang.
14) Allow to filter rule dumps in nf_tables based on table and chain
names.
15) Simplify connlabel to always use 128 bits to store labels and
get rid of unused function in xt_connlabel, from Florian.
16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack
helper, by Gao Feng.
17) Put back x_tables module reference in nft_compat on error, from
Liping Zhang.
18) Add a reference count to the x_tables extensions cache in
nft_compat, so we can remove them when unused and avoid a crash
if the extensions are rmmod, again from Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big tty and serial driver update for 4.8-rc1.
Lots of good cleanups from Jiri on a number of vt and other tty related
things, and the normal driver updates. Full details are in the
shortlog.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iFYEABECABYFAleVPbQPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspWXgAn046
QCMeFya4J1zjYjcGXJzNfGMUAKCHxha8Xe65cc0LDz8mNB0MgzjHEg==
=ED8v
-----END PGP SIGNATURE-----
Merge tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big tty and serial driver update for 4.8-rc1.
Lots of good cleanups from Jiri on a number of vt and other tty
related things, and the normal driver updates. Full details are in
the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (90 commits)
tty/serial: atmel: enforce tasklet init and termination sequences
serial: sh-sci: Stop transfers in sci_shutdown()
serial: 8250_ingenic: drop #if conditional surrounding earlycon code
serial: 8250_mtk: drop !defined(MODULE) conditional
serial: 8250_uniphier: drop !defined(MODULE) conditional
earlycon: mark earlycon code as __used iif the caller is built-in
tty/serial/8250: use mctrl_gpio helpers
serial: mctrl_gpio: enable API usage only for initialized mctrl_gpios struct
serial: mctrl_gpio: add modem control read routine
tty/serial/8250: make UART_MCR register access consistent
serial: 8250_mid: Read RX buffer on RX DMA timeout for DNV
serial: 8250_dma: Export serial8250_rx_dma_flush()
dmaengine: hsu: Export hsu_dma_get_status()
tty: serial: 8250: add CON_CONSDEV to flags
tty: serial: samsung: add byte-order aware bit functions
tty: serial: samsung: fixup accessors for endian
serial: sirf: make fifo functions static
serial: mps2-uart: make driver explicitly non-modular
serial: mvebu-uart: free the IRQ in ->shutdown()
serial/bcm63xx_uart: use correct alias naming
...
that caused misdirected requests, tagged for stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXk4kSAAoJEEp/3jgCEfOLvpsIAKJTs1ELIQ5RmfvwdvqyqI0N
DoSA6rBYIwQvjqBjevJw72w5HKR7hhJoxaEjXEFrw9zbRmXMNnlk5xZpgD8vy5E3
1iCA8LtscFp4ld4ZNWIus45mUpf6/a5ugPd9Mr3V5C4J05LWqZeXufpAHNHyFbII
++hTu6J/RAg8DddEUhBcDl7c65tQpc8ai0h8ll0pLRYNFLPeCoYO3yTitEYax4fR
i6erB3+7pNWnZIsPnUTrXS4B2NG5kPmflVkD7UH9i14PwdQ4QO85LSXD1o8xYrpa
Occ9EvgFuT8zTJHckCEcT2Y0dINz2uHiE05DUea3Udz82keV9zKeZhZUDwJ95RE=
=P1qk
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for a long-standing bug in the incremental osdmap handling code
that caused misdirected requests, tagged for stable"
The tag is signed with a brand new key - Sage is on vacation and I
didn't anticipate this"
* tag 'ceph-for-4.7-rc8' of git://github.com/ceph/ceph-client:
libceph: apply new_state before new_up_client on incrementals
We "cache" the loaded match/target modules and reuse them, but when the
modules are removed, we still point to them. Then we may end up with
invalid memory references when using iptables-compat to add rules later.
Input the following commands will reproduce the kernel crash:
# iptables-compat -A INPUT -j LOG
# iptables-compat -D INPUT -j LOG
# rmmod xt_LOG
# iptables-compat -A INPUT -j LOG
BUG: unable to handle kernel paging request at ffffffffa05a9010
IP: [<ffffffff813f783e>] strcmp+0xe/0x30
Call Trace:
[<ffffffffa05acc43>] nft_target_select_ops+0x83/0x1f0 [nft_compat]
[<ffffffffa058a177>] nf_tables_expr_parse+0x147/0x1f0 [nf_tables]
[<ffffffffa058e541>] nf_tables_newrule+0x301/0x810 [nf_tables]
[<ffffffff8141ca00>] ? nla_parse+0x20/0x100
[<ffffffffa057fa8f>] nfnetlink_rcv+0x33f/0x53d [nfnetlink]
[<ffffffffa057f94b>] ? nfnetlink_rcv+0x1fb/0x53d [nfnetlink]
[<ffffffff817116b8>] netlink_unicast+0x178/0x220
[<ffffffff81711a5b>] netlink_sendmsg+0x2fb/0x3a0
[<ffffffff816b7fc8>] sock_sendmsg+0x38/0x50
[<ffffffff816b8a7e>] ___sys_sendmsg+0x28e/0x2a0
[<ffffffff816bcb7e>] ? release_sock+0x1e/0xb0
[<ffffffff81804ac5>] ? _raw_spin_unlock_bh+0x35/0x40
[<ffffffff816bcbe2>] ? release_sock+0x82/0xb0
[<ffffffff816b93d4>] __sys_sendmsg+0x54/0x90
[<ffffffff816b9422>] SyS_sendmsg+0x12/0x20
[<ffffffff81805172>] entry_SYSCALL_64_fastpath+0x1a/0xa9
So when nobody use the related match/target module, there's no need to
"cache" it. And nft_[match|target]_release are useless anymore, remove
them.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the user specify the invalid NFTA_MATCH_INFO/NFTA_TARGET_INFO attr
or memory alloc fail, we should call module_put to the related match
or target. Otherwise, we cannot remove the module even nobody use it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simplify the code without any side effect. The set_expect_timeout is
used to modify the timer expired time. It tries to delete timer, and
add it again. So we could use mod_timer directly.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The conntrack label extension is currently variable-sized, e.g. if
only 2 labels are used by iptables rules then the labels->bits[] array
will only contain one element.
We track size of each label storage area in the 'words' member.
But in nftables and openvswitch we always have to ask for worst-case
since we don't know what bit will be used at configuration time.
As most arches are 64bit we need to allocate 24 bytes in this case:
struct nf_conn_labels {
u8 words; /* 0 1 */
/* XXX 7 bytes hole, try to pack */
long unsigned bits[2]; /* 8 24 */
Make bits a fixed size and drop the words member, it simplifies
the code and only increases memory requirements on x86 when
less than 64bit labels are required.
We still only allocate the extension if its needed.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, osd_weight and osd_state fields are updated in the encoding
order. This is wrong, because an incremental map may look like e.g.
new_up_client: { osd=6, addr=... } # set osd_state and addr
new_state: { osd=6, xorstate=EXISTS } # clear osd_state
Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After
applying new_up_client, osd_state is changed to EXISTS | UP. Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state. A non-existent OSD is considered down by the
mapping code
2087 for (i = 0; i < pg->pg_temp.len; i++) {
2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089 if (ceph_can_shift_osds(pi))
2090 continue;
2091
2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE;
and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:
[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
and hung rbds on the client:
[ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[ 493.566805] rbd: rbd0: result -6 xferred 400000
[ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed
Fixes: http://tracker.ceph.com/issues/14901
Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc1d: libceph: set 'exists' flag for newly up osd
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
sock_cmsg_send() can return different error codes and not only
-EINVAL, and we should properly propagate them.
Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXjp/3AAoJEIqAPN1PVmxKjpEP/2bgptSwf2cCql+Jv+YaLLny
PjKr+qBXxqvogclucaGg5iIFnVjJ/pHd+csCeqHWpT1jAK7DtK1IJZjg6K9nD7vO
OQQ49+oxcFLTTy00rbfCFzxRaDDnkhD/qafXkTomEMiH7QVK0qssaxm2FVFVEblW
1NTmcsEUnbDbccQhXnxh+N7Xt2CAgsMIbbyHM+4yQuqGtSYjFd164c3jTL13W4a5
SQEJZkCtI7DIdFd6SiXkTGNjlN/fqXuUqXsf2EHxdFb7EE0c067uHpudp2hFdAem
WmAYjjmIuTRFwRFKPJMLUakSen3XbBKVUbtDnIMYykNWYnC4CmedrCrX3YRw4GQt
hZgkj6o5IweSSf6foIgihurE6m5jqd2mAcauwYC/K9wW5nHLaKg8fd9gAngoWY7P
MKBOCyjqIPWkNDC5tne6qftpsDhCrBcdrAtbkorx0lHF20OFto7Gjzxx1Ca+fnJg
N9/fMulQJu8rz3FYpvfvogQMJjkjeFUfyZDa3/ft/ySU6qohxDwXOFaZ82lieTAo
PztTq8tY7GDrdJdyvvHx78RpRVCJT8qHzBRIiiZRpt9MM/aPSepLcozwM97WrJDa
sPvz0jol4d12VIy02j2ArPjMon1MrQePed+Y1y2OtBt8rGiSUxC94t5LE3aMqPs/
a9tNLZYL/nixpLeXbeWa
=yrVP
-----END PGP SIGNATURE-----
Merge tag 'nfc-next-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.8 pull request
This is the first NFC pull request for 4.8. We have:
- A fairly large NFC digital stack patchset:
* RTOX fixes.
* Proper DEP RWT support.
* ACK and NACK PDUs handling fixes, in both initiator
and target modes.
* A few memory leak fixes.
- A conversion of the nfcsim driver to use the digital stack.
The driver supports the DEP protocol in both NFC-A and NFC-F.
- Error injection through debugfs for the nfcsim driver.
- Improvements to the port100 driver for the Sony USB chipset, in
particular to the command abort and cancellation code paths.
- A few minor fixes for the pn533, trf7970a and fdp drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The IFLA_XDP_ATTACHED nested attribute is meant for read-only, and while
do_setlink properly ignores it, it should be more paranoid and reject
commands that try to set it.
Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the table and/or chain attributes are set in a rule dump request,
we filter out the rules based on this selection.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There's a similar problem in xt_NFLOG, and was fixed by commit 7643507fe8
("netfilter: xt_NFLOG: nflog-range does not truncate packets"). Only set
copy_len here does not work, so we should enable NF_LOG_F_COPY_LEN also.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
User can specify the log level larger than 7(debug level) via
nfnetlink, this is invalid. So in this case, we should report
EINVAL to the userspace.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Suppose that we specify the NFTA_LOG_PREFIX, then NFTA_LOG_LEVEL
and NFTA_LOG_GROUP are specified together or nf_logger_find_get
call returns fail, i.e. expr init fail, memory leak will happen.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add nf_ct_helper_init(), nf_conntrack_helpers_register() and
nf_conntrack_helpers_unregister() functions to avoid repetitive
opencoded initialization in helpers.
This patch keeps an id parameter for nf_ct_helper_init() not to break
helper matching by name that has been inconsistently exposed to
userspace through ports, eg. ftp-2121, and through an incremental id,
eg. tftp-1.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>