Commit Graph

44644 Commits

Author SHA1 Message Date
Michael S. Tsirkin f83f12d660 vsock/virtio: fix src/dst cid format
These fields are 64 bit, using le32_to_cpu and friends
on these will not do the right thing.
Fix this up.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15 06:59:19 +02:00
Michael S. Tsirkin 819483d806 vsock/virtio: mark an internal function static
virtio_transport_alloc_pkt is only used locally, make it static.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15 06:59:19 +02:00
Michael S. Tsirkin 6c7efafdd5 vsock/virtio: add a missing __le annotation
guest cid is read from config space, therefore it's in little endian
format and is treated as such, annotate it accordingly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2016-12-15 06:59:18 +02:00
Linus Torvalds a57cb1c1d7 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a few misc things

 - kexec updates

 - DMA-mapping updates to better support networking DMA operations

 - IPC updates

 - various MM changes to improve DAX fault handling

 - lots of radix-tree changes, mainly to the test suite. All leading up
   to reimplementing the IDA/IDR code to be a wrapper layer over the
   radix-tree. However the final trigger-pulling patch is held off for
   4.11.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  radix tree test suite: delete unused rcupdate.c
  radix tree test suite: add new tag check
  radix-tree: ensure counts are initialised
  radix tree test suite: cache recently freed objects
  radix tree test suite: add some more functionality
  idr: reduce the number of bits per level from 8 to 6
  rxrpc: abstract away knowledge of IDR internals
  tpm: use idr_find(), not idr_find_slowpath()
  idr: add ida_is_empty
  radix tree test suite: check multiorder iteration
  radix-tree: fix replacement for multiorder entries
  radix-tree: add radix_tree_split_preload()
  radix-tree: add radix_tree_split
  radix-tree: add radix_tree_join
  radix-tree: delete radix_tree_range_tag_if_tagged()
  radix-tree: delete radix_tree_locate_item()
  radix-tree: improve multiorder iterators
  btrfs: fix race in btrfs_free_dummy_fs_info()
  radix-tree: improve dump output
  radix-tree: make radix_tree_find_next_bit more useful
  ...
2016-12-14 17:25:18 -08:00
Matthew Wilcox 444306129a rxrpc: abstract away knowledge of IDR internals
Add idr_get_cursor() / idr_set_cursor() APIs, and remove the reference
to IDR_SIZE.

Link: http://lkml.kernel.org/r/1480369871-5271-65-git-send-email-mawilcox@linuxonhyperv.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:10 -08:00
Linus Torvalds dcdaa2f948 Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit
Pull audit updates from Paul Moore:
 "After the small number of patches for v4.9, we've got a much bigger
  pile for v4.10.

  The bulk of these patches involve a rework of the audit backlog queue
  to enable us to move the netlink multicasting out of the task/thread
  that generates the audit record and into the kernel thread that emits
  the record (just like we do for the audit unicast to auditd).

  While we were playing with the backlog queue(s) we fixed a number of
  other little problems with the code, and from all the testing so far
  things look to be in much better shape now. Doing this also allowed us
  to re-enable disabling IRQs for some netns operations ("netns: avoid
  disabling irq for netns id").

  The remaining patches fix some small problems that are well documented
  in the commit descriptions, as well as adding session ID filtering
  support"

* 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit:
  audit: use proper refcount locking on audit_sock
  netns: avoid disabling irq for netns id
  audit: don't ever sleep on a command record/message
  audit: handle a clean auditd shutdown with grace
  audit: wake up kauditd_thread after auditd registers
  audit: rework audit_log_start()
  audit: rework the audit queue handling
  audit: rename the queues and kauditd related functions
  audit: queue netlink multicast sends just like we do for unicast sends
  audit: fixup audit_init()
  audit: move kaudit thread start from auditd registration to kaudit init (#2)
  audit: add support for session ID user filter
  audit: fix formatting of AUDIT_CONFIG_CHANGE events
  audit: skip sessionid sentinel value when auto-incrementing
  audit: tame initialization warning len_abuf in audit_log_execve_info
  audit: less stack usage for /proc/*/loginuid
2016-12-14 14:06:40 -08:00
Ilya Dryomov 45ee2c1d66 libceph: remove now unused finish_request() wrapper
Kill the wrapper and rename __finish_request() to finish_request().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-14 22:39:08 +01:00
Ilya Dryomov c297eb4269 libceph: always signal completion when done
r_safe_completion is currently, and has always been, signaled only if
on-disk ack was requested.  It's there for fsync and syncfs, which wait
for in-flight writes to flush - all data write requests set ONDISK.

However, the pool perm check code introduced in 4.2 sends a write
request with only ACK set.  An unfortunately timed syncfs can then hang
forever: r_safe_completion won't be signaled because only an unsafe
reply was requested.

We could patch ceph_osdc_sync() to skip !ONDISK write requests, but
that is somewhat incomplete and yet another special case.  Instead,
rename this completion to r_done_completion and always signal it when
the OSD client is done with the request, whether unsafe, safe, or
error.  This is a bit cleaner and helps with the cancellation code.

Reported-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-14 22:39:08 +01:00
Paul Moore fba143c66a netns: avoid disabling irq for netns id
Bring back commit bc51dddf98 ("netns: avoid disabling irq for netns
id") now that we've fixed some audit multicast issues that caused
problems with original attempt.  Additional information, and history,
can be found in the links below:

 * https://github.com/linux-audit/audit-kernel/issues/22
 * https://github.com/linux-audit/audit-kernel/issues/23

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2016-12-14 13:06:04 -05:00
Steve Wise 39384f04d0 rds_rdma: log the connection reject message
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2016-12-14 11:38:28 -05:00
Cedric Izoard d8da0b5d64 mac80211: Ensure enough headroom when forwarding mesh pkt
When a buffer is duplicated during MESH packet forwarding,
this patch ensures that the new buffer has enough headroom.

Signed-off-by: Cedric Izoard <cedric.izoard@ceva-dsp.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-13 16:08:37 +01:00
Johannes Berg ec4efc4a10 mac80211: don't call drv_set_default_unicast_key() for VLANs
Since drivers know nothing about AP_VLAN interfaces, trying to
call drv_set_default_unicast_key() just results in a warning
and no call to the driver. Avoid the warning by not calling the
driver for this on AP_VLAN interfaces.

This means that drivers that somehow need this call for AP mode
will fail to work properly in the presence of VLAN interfaces,
but the current drivers don't seem to use it, and mac80211 will
select and indicate the key - so drivers should be OK now.

Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-13 15:57:59 +01:00
Linus Torvalds e71c3978d6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the final round of converting the notifier mess to the state
  machine. The removal of the notifiers and the related infrastructure
  will happen around rc1, as there are conversions outstanding in other
  trees.

  The whole exercise removed about 2000 lines of code in total and in
  course of the conversion several dozen bugs got fixed. The new
  mechanism allows to test almost every hotplug step standalone, so
  usage sites can exercise all transitions extensively.

  There is more room for improvement, like integrating all the
  pointlessly different architecture mechanisms of synchronizing,
  setting cpus online etc into the core code"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  tracing/rb: Init the CPU mask on allocation
  soc/fsl/qbman: Convert to hotplug state machine
  soc/fsl/qbman: Convert to hotplug state machine
  zram: Convert to hotplug state machine
  KVM/PPC/Book3S HV: Convert to hotplug state machine
  arm64/cpuinfo: Convert to hotplug state machine
  arm64/cpuinfo: Make hotplug notifier symmetric
  mm/compaction: Convert to hotplug state machine
  iommu/vt-d: Convert to hotplug state machine
  mm/zswap: Convert pool to hotplug state machine
  mm/zswap: Convert dst-mem to hotplug state machine
  mm/zsmalloc: Convert to hotplug state machine
  mm/vmstat: Convert to hotplug state machine
  mm/vmstat: Avoid on each online CPU loops
  mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
  tracing/rb: Convert to hotplug state machine
  oprofile/nmi timer: Convert to hotplug state machine
  net/iucv: Use explicit clean up labels in iucv_init()
  x86/pci/amd-bus: Convert to hotplug state machine
  x86/oprofile/nmi: Convert to hotplug state machine
  ...
2016-12-12 19:25:04 -08:00
Tobias Klauser f6c0d1a3ed crush: include mapper.h in mapper.c
Include linux/crush/mapper.h in crush/mapper.c to get the prototypes of
crush_find_rule and crush_do_rule which are defined there. This fixes
the following GCC warnings when building with 'W=1':

  net/ceph/crush/mapper.c:40:5: warning: no previous prototype for ‘crush_find_rule’ [-Wmissing-prototypes]
  net/ceph/crush/mapper.c:793:5: warning: no previous prototype for ‘crush_do_rule’ [-Wmissing-prototypes]

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
[idryomov@gmail.com: corresponding !__KERNEL__ include]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-12 23:54:26 +01:00
Ilya Dryomov b3bbd3f2ab libceph: no need to drop con->mutex for ->get_authorizer()
->get_authorizer(), ->verify_authorizer_reply(), ->sign_message() and
->check_message_signature() shouldn't be doing anything with or on the
connection (like closing it or sending messages).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov 0dde584882 libceph: drop len argument of *verify_authorizer_reply()
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply.  Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov 5c056fdc5b libceph: verify authorize reply on connect
After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1b ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down.  Pass 0 to facilitate backporting, and kill
it in the next commit.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov 5418d0a2c8 libceph: no need for GFP_NOFS in ceph_monc_init()
It's called during inital setup, when everything should be allocated
with GFP_KERNEL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Ilya Dryomov 7af3ea189a libceph: stop allocating a new cipher on every crypto request
This is useless and more importantly not allowed on the writeback path,
because crypto_alloc_skcipher() allocates memory with GFP_KERNEL, which
can recurse back into the filesystem:

    kworker/9:3     D ffff92303f318180     0 20732      2 0x00000080
    Workqueue: ceph-msgr ceph_con_workfn [libceph]
     ffff923035dd4480 ffff923038f8a0c0 0000000000000001 000000009eb27318
     ffff92269eb28000 ffff92269eb27338 ffff923036b145ac ffff923035dd4480
     00000000ffffffff ffff923036b145b0 ffffffff951eb4e1 ffff923036b145a8
    Call Trace:
     [<ffffffff951eb4e1>] ? schedule+0x31/0x80
     [<ffffffff951eb77a>] ? schedule_preempt_disabled+0xa/0x10
     [<ffffffff951ed1f4>] ? __mutex_lock_slowpath+0xb4/0x130
     [<ffffffff951ed28b>] ? mutex_lock+0x1b/0x30
     [<ffffffffc0a974b3>] ? xfs_reclaim_inodes_ag+0x233/0x2d0 [xfs]
     [<ffffffff94d92ba5>] ? move_active_pages_to_lru+0x125/0x270
     [<ffffffff94f2b985>] ? radix_tree_gang_lookup_tag+0xc5/0x1c0
     [<ffffffff94dad0f3>] ? __list_lru_walk_one.isra.3+0x33/0x120
     [<ffffffffc0a98331>] ? xfs_reclaim_inodes_nr+0x31/0x40 [xfs]
     [<ffffffff94e05bfe>] ? super_cache_scan+0x17e/0x190
     [<ffffffff94d919f3>] ? shrink_slab.part.38+0x1e3/0x3d0
     [<ffffffff94d9616a>] ? shrink_node+0x10a/0x320
     [<ffffffff94d96474>] ? do_try_to_free_pages+0xf4/0x350
     [<ffffffff94d967ba>] ? try_to_free_pages+0xea/0x1b0
     [<ffffffff94d863bd>] ? __alloc_pages_nodemask+0x61d/0xe60
     [<ffffffff94ddf42d>] ? cache_grow_begin+0x9d/0x560
     [<ffffffff94ddfb88>] ? fallback_alloc+0x148/0x1c0
     [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
     [<ffffffff94de09db>] ? __kmalloc+0x1eb/0x580
     [<ffffffffc09fe2db>] ? crush_choose_firstn+0x3eb/0x470 [libceph]
     [<ffffffff94ed84e7>] ? __crypto_alloc_tfm+0x37/0x130
     [<ffffffff94ed9c19>] ? crypto_spawn_tfm+0x39/0x60
     [<ffffffffc08b30a3>] ? crypto_cbc_init_tfm+0x23/0x40 [cbc]
     [<ffffffff94ed857c>] ? __crypto_alloc_tfm+0xcc/0x130
     [<ffffffff94edcc23>] ? crypto_skcipher_init_tfm+0x113/0x180
     [<ffffffff94ed7cc3>] ? crypto_create_tfm+0x43/0xb0
     [<ffffffff94ed83b0>] ? crypto_larval_lookup+0x150/0x150
     [<ffffffff94ed7da2>] ? crypto_alloc_tfm+0x72/0x120
     [<ffffffffc0a01dd7>] ? ceph_aes_encrypt2+0x67/0x400 [libceph]
     [<ffffffffc09fd264>] ? ceph_pg_to_up_acting_osds+0x84/0x5b0 [libceph]
     [<ffffffff950d40a0>] ? release_sock+0x40/0x90
     [<ffffffff95139f94>] ? tcp_recvmsg+0x4b4/0xae0
     [<ffffffffc0a02714>] ? ceph_encrypt2+0x54/0xc0 [libceph]
     [<ffffffffc0a02b4d>] ? ceph_x_encrypt+0x5d/0x90 [libceph]
     [<ffffffffc0a02bdf>] ? calcu_signature+0x5f/0x90 [libceph]
     [<ffffffffc0a02ef5>] ? ceph_x_sign_message+0x35/0x50 [libceph]
     [<ffffffffc09e948c>] ? prepare_write_message_footer+0x5c/0xa0 [libceph]
     [<ffffffffc09ecd18>] ? ceph_con_workfn+0x2258/0x2dd0 [libceph]
     [<ffffffffc09e9903>] ? queue_con_delay+0x33/0xd0 [libceph]
     [<ffffffffc09f68ed>] ? __submit_request+0x20d/0x2f0 [libceph]
     [<ffffffffc09f6ef8>] ? ceph_osdc_start_request+0x28/0x30 [libceph]
     [<ffffffffc0b52603>] ? rbd_queue_workfn+0x2f3/0x350 [rbd]
     [<ffffffff94c94ec0>] ? process_one_work+0x160/0x410
     [<ffffffff94c951bd>] ? worker_thread+0x4d/0x480
     [<ffffffff94c95170>] ? process_one_work+0x410/0x410
     [<ffffffff94c9af8d>] ? kthread+0xcd/0xf0
     [<ffffffff951efb2f>] ? ret_from_fork+0x1f/0x40
     [<ffffffff94c9aec0>] ? kthread_create_on_node+0x190/0x190

Allocating the cipher along with the key fixes the issue - as long the
key doesn't change, a single cipher context can be used concurrently in
multiple requests.

We still can't take that GFP_KERNEL allocation though.  Both
ceph_crypto_key_clone() and ceph_crypto_key_decode() are called from
GFP_NOFS context, so resort to memalloc_noio_{save,restore}() here.

Reported-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov 6db2304aab libceph: uninline ceph_crypto_key_destroy()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov 2b1e1a7cd0 libceph: remove now unused ceph_*{en,de}crypt*() functions
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:20 +01:00
Ilya Dryomov e15fd0a11d libceph: switch ceph_x_decrypt() to ceph_crypt()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov d03857c63b libceph: switch ceph_x_encrypt() to ceph_crypt()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov 4eb4517ce7 libceph: tweak calcu_signature() a little
- replace an ad-hoc array with a struct
- rename to calc_signature() for consistency

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov 7882a26d2e libceph: rename and align ceph_x_authorizer::reply_buf
It's going to be used as a temporary buffer for in-place en/decryption
with ceph_crypt() instead of on-stack buffers, so rename to enc_buf.
Ensure alignment to avoid GFP_ATOMIC allocations in the crypto stack.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov a45f795c65 libceph: introduce ceph_crypt() for in-place en/decryption
Starting with 4.9, kernel stacks may be vmalloced and therefore not
guaranteed to be physically contiguous; the new CONFIG_VMAP_STACK
option is enabled by default on x86.  This makes it invalid to use
on-stack buffers with the crypto scatterlist API, as sg_set_buf()
expects a logical address and won't work with vmalloced addresses.

There isn't a different (e.g. kvec-based) crypto API we could switch
net/ceph/crypto.c to and the current scatterlist.h API isn't getting
updated to accommodate this use case.  Allocating a new header and
padding for each operation is a non-starter, so do the en/decryption
in-place on a single pre-assembled (header + data + padding) heap
buffer.  This is explicitly supported by the crypto API:

    "... the caller may provide the same scatter/gather list for the
     plaintext and cipher text. After the completion of the cipher
     operation, the plaintext data is replaced with the ciphertext data
     in case of an encryption and vice versa for a decryption."

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov 55d9cc834f libceph: introduce ceph_x_encrypt_offset()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov 462e650451 libceph: old_key in process_one_ticket() is redundant
Since commit 0a990e7093 ("ceph: clean up service ticket decoding"),
th->session_key isn't assigned until everything is decoded.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Ilya Dryomov 36721ece1e libceph: ceph_x_encrypt_buflen() takes in_len
Pass what's going to be encrypted - that's msg_b, not ticket_blob.
ceph_x_encrypt_buflen() returns the upper bound, so this doesn't change
the maxlen calculation, but makes it a bit clearer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:19 +01:00
Linus Torvalds 6cdf89b1ca Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The tree got pretty big in this development cycle, but the net effect
  is pretty good:

    115 files changed, 673 insertions(+), 1522 deletions(-)

  The main changes were:

   - Rework and generalize the mutex code to remove per arch mutex
     primitives. (Peter Zijlstra)

   - Add vCPU preemption support: add an interface to query the
     preemption status of vCPUs and use it in locking primitives - this
     optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
     Christian Borntraeger)

   - Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
     clean up and improve the s390 lock yielding machinery and its core
     kernel impact. (Christian Borntraeger)

   - Micro-optimize mutexes some more. (Waiman Long)

   - Reluctantly add the to-be-deprecated mutex_trylock_recursive()
     interface on a temporary basis, to give the DRM code more time to
     get rid of its locking hacks. Any other users will be NAK-ed on
     sight. (We turned off the deprecation warning for the time being to
     not pollute the build log.) (Peter Zijlstra)

   - Improve the rtmutex code a bit, in light of recent long lived
     bugs/races. (Thomas Gleixner)

   - Misc fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  x86/paravirt: Fix bool return type for PVOP_CALL()
  x86/paravirt: Fix native_patch()
  locking/ww_mutex: Use relaxed atomics
  locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
  locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
  x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
  locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
  locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
  Documentation/virtual/kvm: Support the vCPU preemption check
  x86/xen: Support the vCPU preemption check
  x86/kvm: Support the vCPU preemption check
  x86/kvm: Support the vCPU preemption check
  kvm: Introduce kvm_write_guest_offset_cached()
  locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
  locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
  locking/core, powerpc: Implement vcpu_is_preempted(cpu)
  sched/core: Introduce the vcpu_is_preempted(cpu) interface
  sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
  locking/core: Provide common cpu_relax_yield() definition
  locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
  ...
2016-12-12 10:48:02 -08:00
Pablo Neira d84701ecbc netfilter: nft_counter: rework atomic dump and reset
Dump and reset doesn't work unless cmpxchg64() is used both from packet
and control plane paths. This approach is going to be slow though.
Instead, use a percpu seqcount to fetch counters consistently, then
subtract bytes and packets in case a reset was requested.

The cpu that running over the reset code is guaranteed to own this stats
exclusively, we have to turn counters into signed 64bit though so stats
update on reset don't get wrong on underflow.

This patch is based on original sketch from Eric Dumazet.

Fixes: 43da04a593 ("netfilter: nf_tables: atomic dump and reset for stateful objects")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-11 10:01:05 -05:00
Asbjørn Sloth Tønnesen fba40c632c net: l2tp: ppp: change PPPOL2TP_MSG_* => L2TP_MSG_*
Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 23:29:11 -05:00
Asbjørn Sloth Tønnesen 41c43fbee6 net: l2tp: export debug flags to UAPI
Move the L2TP_MSG_* definitions to UAPI, as it is part of
the netlink API.

Signed-off-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 23:29:11 -05:00
Vivien Didelot 34d8acd8aa net: bridge: shorten ageing time on topology change
802.1D [1] specifies that the bridges must use a short value to age out
dynamic entries in the Filtering Database for a period, once a topology
change has been communicated by the root bridge.

Add a bridge_ageing_time member in the net_bridge structure to store the
bridge ageing time value configured by the user (ioctl/netlink/sysfs).

If we are using in-kernel STP, shorten the ageing time value to twice
the forward delay used by the topology when the topology change flag is
set. When the flag is cleared, restore the configured ageing time.

[1] "8.3.5 Notifying topology changes ",
    http://profesores.elo.utfsm.cl/~agv/elo309/doc/802.1D-1998.pdf

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 21:28:28 -05:00
Vivien Didelot 8384b5f5b2 net: bridge: add helper to set topology change
Add a __br_set_topology_change helper to set the topology change value.

This can be later extended to add actions when the topology change flag
is set or cleared.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 21:27:23 -05:00
Vivien Didelot 82dd4332aa net: bridge: add helper to offload ageing time
The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME switchdev attr is actually set
when initializing a bridge port, and when configuring the bridge ageing
time from ioctl/netlink/sysfs.

Add a __set_ageing_time helper to offload the ageing time to physical
switches, and add the SWITCHDEV_F_DEFER flag since it can be called
under bridge lock.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 21:27:23 -05:00
Amit Kushwaha fa1bd57a63 net: socket: removed an unnecessary newline
This patch removes a newline which was added
in socket.c file in net-next

Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 17:27:07 -05:00
WANG Cong efa172f428 netlink: use blocking notifier
netlink_chain is called in ->release(), which is apparently
a process context, so we don't have to use an atomic notifier
here.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-10 17:25:58 -05:00
David S. Miller 821781a9f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-10 16:21:55 -05:00
Trond Myklebust 2549f307b5 NFS: NFSoRDMA Client Side Changes
New Features:
 - Support for SG_GAP devices
 
 Bugfixes and cleanups:
 - Cap size of callback buffer resources
 - Improve send queue and RPC metric accounting
 - Fix coverity warning
 - Avoid calls to ro_unmap_safe()
 - Refactor FRMR invalidation
 - Error message improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlhLGYgACgkQ18tUv7Cl
 QOutFQ/+IIFm1QiooE707RkvM2dTCqJT4mON5mmnkEdTdhkxp+lr9cGO7bv4XpkN
 JVd36swOI+kXv9U/RL+zNZpqufwJxEV7Lu0OsI8j7ETtNtcWQCTYvGkQJLtVNo/6
 9ISxydVmEOo187AafNxZRRAPk1Q17yR/xtvKPMRPfRq9Nk+L+caQqJXmtAO/IMwS
 nlRRfuTozkCNNviPyDEPMJywUGmpGYw48+PkaE60lUcEiH3s0kR8WEoh/nVFyL2o
 svuBGYvEGeaHvsLA0a1sBeA1cJaTh5EW2uh6AGAt9H5yyfLdOoQVNJOJvcNDdkIt
 IQZUyUadOY6CHdO4pXF8W32TPO42baLq1xPYTsUdgzjfPOE9WxFhiHYm/NHZlV9D
 2oNGCk8uc+jWlnD3b+Du3asYeg7Va9ad2TdBj6h0FwRfsAlgR8gWi0GA5AoWXStA
 0YGsCO6iFBYP36Bbag0LNhUCzGirDugfim5zq1sVc+Lw8DNIGqLFbpcNUjbBSC89
 CdtRlXWGRHCb5bldim2JVrzj/nDC3dp+KHPC/wUEwS3NWlfTgj/idiyDVjuIKmJg
 mjRI1yVfUL48bq+u5tzhO1rOYSVe1iVsm4q8HVUTW0Q664JvDZ0cy3QapSz4MKmX
 Okk9M64is/iylYHVO9/4iuxOp8mQt95nP4CYCr2k3r3U3DEceVQ=
 =XBzf
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-4.10-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: NFSoRDMA Client Side Changes

New Features:
- Support for SG_GAP devices

Bugfixes and cleanups:
- Cap size of callback buffer resources
- Improve send queue and RPC metric accounting
- Fix coverity warning
- Avoid calls to ro_unmap_safe()
- Refactor FRMR invalidation
- Error message improvements
2016-12-10 10:31:44 -05:00
NeilBrown 1cded9d297 SUNRPC: fix refcounting problems with auth_gss messages.
There are two problems with refcounting of auth_gss messages.

First, the reference on the pipe->pipe list (taken by a call
to rpc_queue_upcall()) is not counted.  It seems to be
assumed that a message in pipe->pipe will always also be in
pipe->in_downcall, where it is correctly reference counted.

However there is no guaranty of this.  I have a report of a
NULL dereferences in rpc_pipe_read() which suggests a msg
that has been freed is still on the pipe->pipe list.

One way I imagine this might happen is:
- message is queued for uid=U and auth->service=S1
- rpc.gssd reads this message and starts processing.
  This removes the message from pipe->pipe
- message is queued for uid=U and auth->service=S2
- rpc.gssd replies to the first message. gss_pipe_downcall()
  calls __gss_find_upcall(pipe, U, NULL) and it finds the
  *second* message, as new messages are placed at the head
  of ->in_downcall, and the service type is not checked.
- This second message is removed from ->in_downcall and freed
  by gss_release_msg() (even though it is still on pipe->pipe)
- rpc.gssd tries to read another message, and dereferences a pointer
  to this message that has just been freed.

I fix this by incrementing the reference count before calling
rpc_queue_upcall(), and decrementing it if that fails, or normally in
gss_pipe_destroy_msg().

It seems strange that the reply doesn't target the message more
precisely, but I don't know all the details.  In any case, I think the
reference counting irregularity became a measureable bug when the
extra arg was added to __gss_find_upcall(), hence the Fixes: line
below.

The second problem is that if rpc_queue_upcall() fails, the new
message is not freed. gss_alloc_msg() set the ->count to 1,
gss_add_msg() increments this to 2, gss_unhash_msg() decrements to 1,
then the pointer is discarded so the memory never gets freed.

Fixes: 9130b8dbc6 ("SUNRPC: allow for upcalls for same uid but different gss service")
Cc: stable@vger.kernel.org
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1011250
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-10 10:29:29 -05:00
Eric Dumazet 3174fed982 net: skb_condense() can also deal with empty skbs
It seems attackers can also send UDP packets with no payload at all.

skb_condense() can still be a win in this case.

It will be possible to replace the custom code in tcp_add_backlog()
to get full benefit from skb_condense()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 23:06:10 -05:00
David S. Miller 5ac9efbe1c Three fixes:
* fix a logic bug introduced by a previous cleanup
  * fix nl80211 attribute confusing (trying to use
    a single attribute for two purposes)
  * fix a long-standing BSS leak that happens when an
    association attempt is abandoned
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJYSpxnAAoJEGt7eEactAAd3hEP/0RzU5BLTe3FD39i2ESo4fQo
 q2Wnaa+ES1Ul473rCuSmPLGzlSjh0GciltHXRu7UEf5zXAjwuQtilrKsI9DizVR8
 hgTV4Jp0TDLuDudgxEPlpLxcFWALDaK0AlKuL1dY/FSI1BnNnToEeX8Bum6/otqe
 2wLQ11+70HrdNHJjvBEHP/kE/2D55easydmkCS30WYlFrd0BEFtGZ6Leb8deIAzL
 qQpanf26jBYVTm7ls+j0bt4mYbb0RLcsLrOS8EgyIYhCsbJHbaC2OpYGTbGxR6ob
 KKx01PGVnzytaKXCx/m70923V2mwWZWwa7IgDfoj2IzvsTnfmCgekGdSCiY+DJjE
 1jiDYWVK3KgTJQqXRnE1BCbF/FPK6ABKoPgmJBAAiLC48VpmrQwG0OLLQmYVTdp9
 KLrQztvZAVV1adA32fGpJHecDyQMMZ2xp7TZn9YY3qAiP4APU8IUscKuSXALmKN9
 kMBUBhwkk7QuHZXkry0QFBpFXpOgYjX3vt/gBh8EAmGfyRIklTKtGsmftkuQbWR9
 9BN4TbPznEJECqVy/BCL8llHNkfsJgcz3noFOePUjwa4FCAxJst/NFya+IkkqOQ5
 eAOj5cjsDfxsrdJFGxIsxXrtGZI1MjwKZf3w6jmu/VVL6BMryxYwtWnwrwcBsit7
 nXjitThBO0V2l3Iaf09m
 =HvKt
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2016-12-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Three fixes:
 * fix a logic bug introduced by a previous cleanup
 * fix nl80211 attribute confusing (trying to use
   a single attribute for two purposes)
 * fix a long-standing BSS leak that happens when an
   association attempt is abandoned
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:59:05 -05:00
Eric Dumazet 02ab0d139c udp: udp_rmem_release() should touch sk_rmem_alloc later
In flood situations, keeping sk_rmem_alloc at a high value
prevents producers from touching the socket.

It makes sense to lower sk_rmem_alloc only at the end
of udp_rmem_release() after the thread draining receive
queue in udp_recvmsg() finished the writes to sk_forward_alloc.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:12:21 -05:00
Eric Dumazet 6b229cf77d udp: add batching to udp_rmem_release()
If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:12:21 -05:00
Eric Dumazet c84d949057 udp: copy skb->truesize in the first cache line
In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:12:21 -05:00
Eric Dumazet 4b272750db udp: add busylocks in RX path
Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-09 22:12:21 -05:00
Johannes Berg e6f462df9a cfg80211/mac80211: fix BSS leaks when abandoning assoc attempts
When mac80211 abandons an association attempt, it may free
all the data structures, but inform cfg80211 and userspace
about it only by sending the deauth frame it received, in
which case cfg80211 has no link to the BSS struct that was
used and will not cfg80211_unhold_bss() it.

Fix this by providing a way to inform cfg80211 of this with
the BSS entry passed, so that it can clean up properly, and
use this ability in the appropriate places in mac80211.

This isn't ideal: some code is more or less duplicated and
tracing is missing. However, it's a fairly small change and
it's thus easier to backport - cleanups can come later.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-09 12:57:49 +01:00
Vamsi Krishna 2fa436b3a2 nl80211: Use different attrs for BSSID and random MAC addr in scan req
NL80211_ATTR_MAC was used to set both the specific BSSID to be scanned
and the random MAC address to be used when privacy is enabled. When both
the features are enabled, both the BSSID and the local MAC address were
getting same value causing Probe Request frames to go with unintended
DA. Hence, this has been fixed by using a different NL80211_ATTR_BSSID
attribute to set the specific BSSID (which was the more recent addition
in cfg80211) for a scan.

Backwards compatibility with old userspace software is maintained to
some extent by allowing NL80211_ATTR_MAC to be used to set the specific
BSSID when scanning without enabling random MAC address use.

Scanning with random source MAC address was introduced by commit
ad2b26abc1 ("cfg80211: allow drivers to support random MAC addresses
for scan") and the issue was introduced with the addition of the second
user for the same attribute in commit 818965d391 ("cfg80211: Allow a
scan request for a specific BSSID").

Fixes: 818965d391 ("cfg80211: Allow a scan request for a specific BSSID")
Signed-off-by: Vamsi Krishna <vamsin@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-09 12:47:19 +01:00
Johannes Berg eeb04a9688 nl80211: fix logic inversion in start_nan()
Arend inadvertently inverted the logic while converting to
wdev_running(), fix that.

Fixes: 73c7da3dae ("cfg80211: add generic helper to check interface is running")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-12-09 12:47:18 +01:00
Amit Kushwaha 846cc1231a net: socket: preferred __aligned(size) for control buffer
This patch cleanup checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))

Signed-off-by: Amit Kushwaha <kushwaha.a@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 18:20:46 -05:00
David S. Miller 107bc0aa95 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-12-08

I didn't miss your "net-next is closed" email, but it did come as a bit
of a surprise, and due to time-zone differences I didn't have a chance
to react to it until now. We would have had a couple of patches in
bluetooth-next that we'd still have wanted to get to 4.10.

Out of these the most critical one is the H7/CT2 patch for Bluetooth
Security Manager Protocol, something that couldn't be published before
the Bluetooth 5.0 specification went public (yesterday). If these really
can't go to net-next we'll likely be sending at least this patch through
bluetooth.git to net.git for rc1 inclusion.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:33:17 -05:00
Martin KaFai Lau 17bedab272 bpf: xdp: Allow head adjustment in XDP prog
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support.  The driver can then decide
to error out during XDP_SETUP_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:25:13 -05:00
Eric Dumazet c8c8b12709 udp: under rx pressure, try to condense skbs
Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:25:07 -05:00
Eric Dumazet 13bfff25c0 net: rfs: add a jump label
RFS is not commonly used, so add a jump label to avoid some conditionals
in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 13:18:35 -05:00
Simon Horman 7b684884fb net/sched: cls_flower: Support matching on ICMP type and code
Support matching on ICMP type and code.

Example usage:

tc qdisc add dev eth0 ingress

tc filter add dev eth0 protocol ip parent ffff: flower \
	indev eth0 ip_proto icmp type 8 code 0 action drop

tc filter add dev eth0 protocol ipv6 parent ffff: flower \
	indev eth0 ip_proto icmpv6 type 128 code 0 action drop

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:47:08 -05:00
Simon Horman 972d3876fa flow dissector: ICMP support
Allow dissection of ICMP(V6) type and code. This should only occur
if a packet is ICMP(V6) and the dissector has FLOW_DISSECTOR_KEY_ICMP set.

There are currently no users of FLOW_DISSECTOR_KEY_ICMP.
A follow-up patch will allow FLOW_DISSECTOR_KEY_ICMP to be used by
the flower classifier.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:45:21 -05:00
Or Gerlitz faa3ffce78 net/sched: cls_flower: Add support for matching on flags
Add UAPI to provide set of flags for matching, where the flags
provided from user-space are mapped to flow-dissector flags.

The 1st flag allows to match on whether the packet is an
IP fragment and corresponds to the FLOW_DIS_IS_FRAGMENT flag.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:32:50 -05:00
Zhang Shengju f91c58d68b icmp: correct return value of icmp_rcv()
Currently, icmp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 11:24:23 -05:00
Johan Hedberg a62da6f14d Bluetooth: SMP: Add support for H7 crypto function and CT2 auth flag
Bluetooth 5.0 introduces a new H7 key generation function that's used
when both sides of the pairing set the CT2 authentication flag to 1.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-12-08 07:50:24 +01:00
David S. Miller 5fccd64aa4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains a large Netfilter update for net-next,
to summarise:

1) Add support for stateful objects. This series provides a nf_tables
   native alternative to the extended accounting infrastructure for
   nf_tables. Two initial stateful objects are supported: counters and
   quotas. Objects are identified by a user-defined name, you can fetch
   and reset them anytime. You can also use a maps to allow fast lookups
   using any arbitrary key combination. More info at:

   http://marc.info/?l=netfilter-devel&m=148029128323837&w=2

2) On-demand registration of nf_conntrack and defrag hooks per netns.
   Register nf_conntrack hooks if we have a stateful ruleset, ie.
   state-based filtering or NAT. The new nf_conntrack_default_on sysctl
   enables this from newly created netnamespaces. Default behaviour is not
   modified. Patches from Florian Westphal.

3) Allocate 4k chunks and then use these for x_tables counter allocation
   requests, this improves ruleset load time and also datapath ruleset
   evaluation, patches from Florian Westphal.

4) Add support for ebpf to the existing x_tables bpf extension.
   From Willem de Bruijn.

5) Update layer 4 checksum if any of the pseudoheader fields is updated.
   This provides a limited form of 1:1 stateless NAT that make sense in
   specific scenario, eg. load balancing.

6) Add support to flush sets in nf_tables. This series comes with a new
   set->ops->deactivate_one() indirection given that we have to walk
   over the list of set elements, then deactivate them one by one.
   The existing set->ops->deactivate() performs an element lookup that
   we don't need.

7) Two patches to avoid cloning packets, thus speed up packet forwarding
   via nft_fwd from ingress. From Florian Westphal.

8) Two IPVS patches via Simon Horman: Decrement ttl in all modes to
   prevent infinite loops, patch from Dwip Banerjee. And one minor
   refactoring from Gao feng.

9) Revisit recent log support for nf_tables netdev families: One patch
   to ensure that we correctly handle non-ethernet packets. Another
   patch to add missing logger definition for netdev. Patches from
   Liping Zhang.

10) Three patches for nft_fib, one to address insufficient register
    initialization and another to solve incorrect (although harmless)
    byteswap operation. Moreover update xt_rpfilter and nft_fib to match
    lbcast packets with zeronet as source, eg. DHCP Discover packets
    (0.0.0.0 -> 255.255.255.255). Also from Liping Zhang.

11) Built-in DCCP, SCTP and UDPlite conntrack and NAT support, from
    Davide Caratti. While DCCP is rather hopeless lately, and UDPlite has
    been broken in many-cast mode for some little time, let's give them a
    chance by placing them at the same level as other existing protocols.
    Thus, users don't explicitly have to modprobe support for this and
    NAT rules work for them. Some people point to the lack of support in
    SOHO Linux-based routers that make deployment of new protocols harder.
    I guess other middleboxes outthere on the Internet are also to blame.
    Anyway, let's see if this has any impact in the midrun.

12) Skip software SCTP software checksum calculation if the NIC comes
    with SCTP checksum offload support. From Davide Caratti.

13) Initial core factoring to prepare conversion to hook array. Three
    patches from Aaron Conole.

14) Gao Feng made a wrong conversion to switch in the xt_multiport
    extension in a patch coming in the previous batch. Fix it in this
    batch.

15) Get vmalloc call in sync with kmalloc flags to avoid a warning
    and likely OOM killer intervention from x_tables. From Marcelo
    Ricardo Leitner.

16) Update Arturo Borrero's email address in all source code headers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07 19:16:46 -05:00
Pablo Neira Ayuso 73c25fb139 netfilter: nft_quota: allow to restore consumed quota
Allow to restore consumed quota, this is useful to restore the quota
state across reboots.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 14:40:53 +01:00
Willem de Bruijn 2c16d60332 netfilter: xt_bpf: support ebpf
Add support for attaching an eBPF object by file descriptor.

The iptables binary can be called with a path to an elf object or a
pinned bpf object. Also pass the mode and path to the kernel to be
able to return it later for iptables dump and save.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:32:35 +01:00
Marcelo Ricardo Leitner 5bad87348c netfilter: x_tables: avoid warn and OOM killer on vmalloc call
Andrey Konovalov reported that this vmalloc call is based on an
userspace request and that it's spewing traces, which may flood the logs
and cause DoS if abused.

Florian Westphal also mentioned that this call should not trigger OOM
killer.

This patch brings the vmalloc call in sync to kmalloc and disables the
warn trace on allocation failure and also disable OOM killer invocation.

Note, however, that under such stress situation, other places may
trigger OOM killer invocation.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:41 +01:00
Pablo Neira Ayuso 8411b6442e netfilter: nf_tables: support for set flushing
This patch adds support for set flushing, that consists of walking over
the set elements if the NFTA_SET_ELEM_LIST_ELEMENTS attribute is set.
This patch requires the following changes:

1) Add set->ops->deactivate_one() operation: This allows us to
   deactivate an element from the set element walk path, given we can
   skip the lookup that happens in ->deactivate().

2) Add a new nft_trans_alloc_gfp() function since we need to allocate
   transactions using GFP_ATOMIC given the set walk path happens with
   held rcu_read_lock.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:40 +01:00
Pablo Neira Ayuso 37df5301a3 netfilter: nft_set: introduce nft_{hash, rbtree}_deactivate_one()
This new function allows us to deactivate one single element, this is
required by the set flush command that comes in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:31:02 +01:00
Pablo Neira Ayuso 1a37ef769d netfilter: nf_tables: constify struct nft_ctx * parameter in nft_trans_alloc()
Context is not modified by nft_trans_alloc(), so constify it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:51 +01:00
Davide Caratti 3189a290f9 netfilter: nat: skip checksum on offload SCTP packets
SCTP GSO and hardware can do CRC32c computation after netfilter processing,
so we can avoid calling sctp_compute_checksum() on skb if skb->ip_summed
is equal to CHECKSUM_PARTIAL. Moreover, set skb->ip_summed to CHECKSUM_NONE
when the NAT code computes the CRC, to prevent offloaders from computing
it again (on ixgbe this resulted in a transmission with wrong L4 checksum).

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:50 +01:00
Liping Zhang 3b760dcb0f netfilter: rpfilter: bypass ipv4 lbcast packets with zeronet source
Otherwise, DHCP Discover packets(0.0.0.0->255.255.255.255) may be
dropped incorrectly.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:50 +01:00
Pablo Neira Ayuso a9fea2a3c3 netfilter: nf_tables: allow to filter stateful object dumps by type
This patch adds the netlink code to filter out dump of stateful objects,
through the NFTA_OBJ_TYPE netlink attribute.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:49 +01:00
Pablo Neira Ayuso 63aea29060 netfilter: nft_objref: support for stateful object maps
This patch allows us to refer to stateful object dictionaries, the
source register indicates the key data to be used to look up for the
corresponding state object. We can refer to these maps through names or,
alternatively, the map transaction id. This allows us to refer to both
anonymous and named maps.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:48 +01:00
Pablo Neira Ayuso 8aeff920dc netfilter: nf_tables: add stateful object reference to set elements
This patch allows you to refer to stateful objects from set elements.
This provides the infrastructure to create maps where the right hand
side of the mapping is a stateful object.

This allows us to build dictionaries of stateful objects, that you can
use to perform fast lookups using any arbitrary key combination.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:47 +01:00
Pablo Neira Ayuso 1896531710 netfilter: nft_quota: add depleted flag for objects
Notify on depleted quota objects. The NFT_QUOTA_F_DEPLETED flag
indicates we have reached overquota.

Add pointer to table from nft_object, so we can use it when sending the
depletion notification to userspace.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 13:22:12 +01:00
Pablo Neira Ayuso 2599e98934 netfilter: nf_tables: notify internal updates of stateful objects
Introduce nf_tables_obj_notify() to notify internal state changes in
stateful objects. This is used by the quota object to report depletion
in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:57:20 +01:00
Pablo Neira Ayuso 43da04a593 netfilter: nf_tables: atomic dump and reset for stateful objects
This patch adds a new NFT_MSG_GETOBJ_RESET command perform an atomic
dump-and-reset of the stateful object. This also comes with add support
for atomic dump and reset for counter and quota objects.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:56:57 +01:00
Pablo Neira Ayuso 795595f68d netfilter: nft_quota: dump consumed quota
Add a new attribute NFTA_QUOTA_CONSUMED that displays the amount of
quota that has been already consumed. This allows us to restore the
internal state of the quota object between reboots as well as to monitor
how wasted it is.

This patch changes the logic to account for the consumed bytes, instead
of the bytes that remain to be consumed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-07 12:54:22 +01:00
Marc Kleine-Budde 332b05ca7a can: raw: raw_setsockopt: limit number of can_filter that can be set
This patch adds a check to limit the number of can_filters that can be
set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER
are not prevented resulting in a warning.

Reference: https://lkml.org/lkml/2016/12/2/230

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-12-07 10:45:57 +01:00
David S. Miller c63d352f05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-06 21:33:19 -05:00
Pablo Neira Ayuso c97d22e68b netfilter: nf_tables: add stateful object reference expression
This new expression allows us to refer to existing stateful objects from
rules.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:25 +01:00
Pablo Neira Ayuso 173705d9a2 netfilter: nft_quota: add stateful object type
Register a new quota stateful object type into the new stateful object
infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:24 +01:00
Pablo Neira Ayuso b1ce0ced10 netfilter: nft_counter: add stateful object type
Register a new percpu counter stateful object type into the stateful
object infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:23 +01:00
Pablo Neira Ayuso e50092404c netfilter: nf_tables: add stateful objects
This patch augments nf_tables to support stateful objects. This new
infrastructure allows you to create, dump and delete stateful objects,
that are identified by a user-defined name.

This patch adds the generic infrastructure, follow up patches add
support for two stateful objects: counters and quotas.

This patch provides a native infrastructure for nf_tables to replace
nfacct, the extended accounting infrastructure for iptables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:22 +01:00
Florian Westphal 3bf3276119 netfilter: add and use nf_fwd_netdev_egress
... so we can use current skb instead of working with a clone.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:22 +01:00
Gao Feng 1ed9887ee3 netfilter: xt_multiport: Fix wrong unmatch result with multiple ports
I lost one test case in the last commit for xt_multiport.
For example, the rule is "-m multiport --dports 22,80,443".
When first port is unmatched and the second is matched, the curent codes
could not return the right result.
It would return false directly when the first port is unmatched.

Fixes: dd2602d00f ("netfilter: xt_multiport: Use switch case instead
of multiple condition checks")
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:48:20 +01:00
Pablo Neira Ayuso 1814096980 netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields
This patch adds a new flag that signals the kernel to update layer 4
checksum if the packet field belongs to the layer 4 pseudoheader. This
implicitly provides stateless NAT 1:1 that is useful under very specific
usecases.

Since rules mangling layer 3 fields that are part of the pseudoheader
may potentially convey any layer 4 packet, we have to deal with the
layer 4 checksum adjustment using protocol specific code.

This patch adds support for TCP, UDP and ICMPv6, since they include the
pseudoheader in the layer 4 checksum calculation. ICMP doesn't, so we
can skip it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:47:54 +01:00
Liping Zhang e0ffdbc78d netfilter: nft_fib_ipv4: initialize *dest to zero
Otherwise, if fib lookup fail, *dest will be filled with garbage value,
so reverse path filtering will not work properly:
 # nft add rule x prerouting fib saddr oif eq 0 drop

Fixes: f6d0cbcf09 ("netfilter: nf_tables: add fib expression")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:21 +01:00
Liping Zhang 11583438b7 netfilter: nft_fib: convert htonl to ntohl properly
Acctually ntohl and htonl are identical, so this doesn't affect
anything, but it is conceptually wrong.

Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:20 +01:00
Florian Westphal ae0ac0ed6f netfilter: x_tables: pack percpu counter allocations
instead of allocating each xt_counter individually, allocate 4k chunks
and then use these for counter allocation requests.

This should speed up rule evaluation by increasing data locality,
also speeds up ruleset loading because we reduce calls to the percpu
allocator.

As Eric points out we can't use PAGE_SIZE, page_allocator would fail on
arches with 64k page size.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:19 +01:00
Florian Westphal f28e15bace netfilter: x_tables: pass xt_counters struct to counter allocator
Keeps some noise away from a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:18 +01:00
Florian Westphal 4d31eef517 netfilter: x_tables: pass xt_counters struct instead of packet counter
On SMP we overload the packet counter (unsigned long) to contain
percpu offset.  Hide this from callers and pass xt_counters address
instead.

Preparation patch to allocate the percpu counters in page-sized batch
chunks.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:17 +01:00
Aaron Conole 679972f3be netfilter: convert while loops to for loops
This is to facilitate converting from a singly-linked list to an array
of elements.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:16 +01:00
Aaron Conole 0aa8c57a04 netfilter: introduce accessor functions for hook entries
This allows easier future refactoring.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:15 +01:00
Florian Westphal 834184b1f3 netfilter: defrag: only register defrag functionality if needed
nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply 'calls'
this empty function to create a phony module dependency -- modprobe will
then load the defrag module too.

This extends netfilter ipv4/ipv6 defragmentation modules to delay the hook
registration until the functionality is requested within a network namespace
instead of module load time for all namespaces.

Hooks are only un-registered on module unload or when a namespace that used
such defrag functionality exits.

We have to use struct net for this as the register hooks can be called
before netns initialization here from the ipv4/ipv6 conntrack module
init path.

There is no unregister functionality support, defrag will always be
active once it was requested inside a net namespace.

The reason is that defrag has impact on nft and iptables rulesets
(without defrag we might see framents).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-12-06 21:42:00 +01:00
Fabian Frederick 3eb15f2828 sunrpc: use DEFINE_SPINLOCK()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-12-06 14:18:30 -05:00
Florian Westphal 343dfaa198 Revert "dctcp: update cwnd on congestion event"
Neal Cardwell says:
 If I am reading the code correctly, then I would have two concerns:
 1) Has that been tested? That seems like an extremely dramatic
    decrease in cwnd. For example, if the cwnd is 80, and there are 40
    ACKs, and half the ACKs are ECE marked, then my back-of-the-envelope
    calculations seem to suggest that after just 11 ACKs the cwnd would be
    down to a minimal value of 2 [..]
 2) That seems to contradict another passage in the draft [..] where it
    sazs:
       Just as specified in [RFC3168], DCTCP does not react to congestion
       indications more than once for every window of data.

Neal is right.  Fortunately we don't have to complicate this by testing
vs. current rtt estimate, we can just revert the patch.

Normal stack already handles this for us: receiving ACKs with ECE
set causes a call to tcp_enter_cwr(), from there on the ssthresh gets
adjusted and prr will take care of cwnd adjustment.

Fixes: 4780566784 ("dctcp: update cwnd on congestion event")
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 11:34:24 -05:00
Marcelo Ricardo Leitner dcb17d22e1 tcp: warn on bogus MSS and try to amend it
There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

v2: updated msg per David's suggestion
v3: use skb_iif to find the interface and also log its name, per Eric
    Dumazet's suggestion. As the skb may be backlogged and the interface
    gone by then, we need to check if the number still has a meaning.
v4: use helper tcp_gro_dev_warn() and avoid pr_warn_once inside __once, per
    David's suggestion

Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 11:01:19 -05:00
Eric Dumazet a297569fe0 net/udp: do not touch skb->peeked unless really needed
In UDP recvmsg() path we currently access 3 cache lines from an skb
while holding receive queue lock, plus another one if packet is
dequeued, since we need to change skb->next->prev

1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)

skb->peeked is only needed to make sure 0-length packets are properly
handled while MSG_PEEK is operated.

I had first the intent to remove skb->peeked but the "MSG_PEEK at
non-zero offset" support added by Sam Kumar makes this not possible.

This patch avoids one cache line miss during the locked section, when
skb->len and skb->peeked do not have to be read.

It also avoids the skb_set_peeked() cost for non empty UDP datagrams.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06 10:41:49 -05:00
Herbert Xu ed5d7788a9 netlink: Do not schedule work from sk_destruct
It is wrong to schedule a work from sk_destruct using the socket
as the memory reserve because the socket will be freed immediately
after the return from sk_destruct.

Instead we should do the deferral prior to sk_free.

This patch does just that.

Fixes: 707693c8a4 ("netlink: Call cb->done from a worker thread")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 19:43:42 -05:00
Daniel Borkmann 7bd509e311 bpf: add prog_digest and expose it via fdinfo/netlink
When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.

fdinfo example output for progs:

  # cat /proc/1590/fdinfo/5
  pos:          0
  flags:        02000002
  mnt_id:       11
  prog_type:    1
  prog_jited:   1
  prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
  memlock:      4096

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:11 -05:00
Daniel Borkmann 8d829bdb97 bpf, cls: consolidate prog deletion path
Commit 18cdb37ebf ("net: sched: do not use tcf_proto 'tp' argument from
call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also
remove it from the function as argument to not give a wrong impression. tp
is illegal to access from this callback, since it could already have been
freed.

Refactor the deletion code a bit, so that cls_bpf_destroy() can call into
the same code for prog deletion as cls_bpf_delete() op, instead of having
it unnecessarily duplicated.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 15:33:10 -05:00