Move initialization of statistics buffers from ibmvnic_init function
into ibmvnic_probe. In the current state, ibmvnic_init will be called
again during a device reset, resulting in the allocation of new
buffers without freeing the old ones.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not necessary to disable interrupt lines here during a reset
to handle a non-fatal firmware error. Move that call within the code
block that handles the other cases that do require interrupts to be
disabled and re-enabled.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the firmware map fails for whatever reason, remember to free
up the memory after.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updating the FIB tracepoint for the recent change to allow rules using
the protocol and ports exposed a few places where the entries in the flow
struct are not initialized.
For __fib_validate_source add the call to fib4_rules_early_flow_dissect
since it is invoked for the input path. For netfilter, add the memset on
the flow struct to avoid future problems like this. In ip_route_input_slow
need to set the fields if the skb dissection does not happen.
Fixes: bfff486265 ("net: fib_rules: support for match on ip_proto, sport and dport")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
scatterlist code expects virt_to_page() to work, which fails with
CONFIG_VMAP_STACK=y.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Signed-off-by: Matt Mullins <mmullins@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* x86 fixes: PCID, UMIP, locking
* Improved support for recent Windows version that have a 2048 Hz
APIC timer.
* Rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
* Better behaved selftests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJa/bkTAAoJEL/70l94x66Dzf8IAJ1GqtXi0CNbq8MvU4QIqw0L
HLIRoe/QgkTeTUa2fwirEuu5I+/wUyPvy5sAIsn/F5eiZM7nciLm+fYzw6F2uPIm
lSCqKpVwmh8dPl1SBaqPnTcB1HPVwcCgc2SF9Ph7yZCUwFUtoeUuPj8v6Qy6y21g
jfobHFZa3MrFgi7kPxOXSrC1qxuNJL9yLB5mwCvCK/K7jj2nrGJkLLDuzgReCqvz
isOdpof3hz8whXDQG5cTtybBgE9veym4YqJY8R5ANXBKqbFlhaNF1T3xXrdPMISZ
7bsGgkhYEOqeQsPrFwzAIiFxe2DogFwkn1BcvJ1B+duXrayt5CBnDPRB6Yxg00M=
=H0d0
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- ARM/ARM64 locking fixes
- x86 fixes: PCID, UMIP, locking
- improved support for recent Windows version that have a 2048 Hz APIC
timer
- rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
- better behaved selftests
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME
KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity
KVM: arm/arm64: Properly protect VGIC locks from IRQs
KVM: X86: Lower the default timer frequency limit to 200us
KVM: vmx: update sec exec controls for UMIP iff emulating UMIP
kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
KVM: selftests: exit with 0 status code when tests cannot be run
KVM: hyperv: idr_find needs RCU protection
x86: Delay skip of emulated hypercall instruction
KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
We have a core fix in the compat code for covering a potential race
(double references), but it's a very minor change.
The rest are all small device-specific quirks, as well as a correction
of the new UAC3 support code.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAlr8J4oOHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE/u9BAAonj61ZwTiKYQS6Zgv/yXnhGeaqMgnu1AG/pf
c3MI9mjR1E+WZy8CehCgNuvd9b6rc5PwrNgmTP58nu/DMZB1DeQkWJgv2fNm3y1c
byuBHG+xH2AdH+mpjIWcMU857T75oaDaj3Gu36ORacCDGOHsdL0OyynT0y/C0LUd
SwEegucAFc9Ft2vb4WfRprm9RiohT7WEyU/G+nACderaIDE12B4/CtC3l64QPWxN
uJydQ4io92qkCMOCXBupGmwUvCCkwB+acTSLRUgKd/IEbp8cTrnOgNpgJmB6TLXY
fj1UO6pi+fp9yXdyWwrDCqsvlrXbmDu25Sqy1CVEA/iApC5mFwFaEvLl5eEhldSV
+o2r6O7N3IOGsMjlAov7lp1wqUgqSaeOWRjFAeNxs5lc+G4Cts27x9XMvpKPNUON
pCAs9C+hdpcIS/ZAdpdO0JVashK6rVIP0oaUBRkTT4kKz5E6TUXbhJ1tUPpzJT0j
98jYQOJmBhQfAfTuN54rpciv5NUA1b/KpV17BothpL6Npe0WYjS037fcwfj8u1DH
T+2NjqZLYUkhzwzU3sDokRJcjCm/Wq2qv2aON/6CQR9LCdIJFKoWc5i7a5/v3Rm5
xXUQgCEPzJ9kzqQguZjn/fQnnMxgK++sYiJP+TKPNxyYn4LlJ+UQBk0/dazLKQtd
dEN+zzo=
=lBzh
-----END PGP SIGNATURE-----
Merge tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"We have a core fix in the compat code for covering a potential race
(double references), but it's a very minor change.
The rest are all small device-specific quirks, as well as a correction
of the new UAC3 support code"
* tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Use Class Specific EP for UAC3 devices.
ALSA: hda/realtek - Clevo P950ER ALC1220 Fixup
ALSA: usb: mixer: volume quirk for CM102-A+/102S+
ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist
ALSA: control: fix a redundant-copy issue
KVM_HINTS_DEDICATED seems to be somewhat confusing:
Guest doesn't really care whether it's the only task running on a host
CPU as long as it's not preempted.
And there are more reasons for Guest to be preempted than host CPU
sharing, for example, with memory overcommit it can get preempted on a
memory access, post copy migration can cause preemption, etc.
Let's call it KVM_HINTS_REALTIME which seems to better
match what guests expect.
Also, the flag most be set on all vCPUs - current guests assume this.
Note so in the documentation.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull s390 fixes from Martin Schwidefsky:
- a fix for the vfio ccw translation code
- update an incorrect email address in the MAINTAINERS file
- fix a division by zero oops in the cpum_sf code found by trinity
- two fixes for the error handling of the qdio code
- several spectre related patches to convert all left-over indirect
branches in the kernel to expoline branches
- update defconfigs to avoid warnings due to the netfilter Kconfig
changes
- avoid several compiler warnings in the kexec_file code for s390
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: don't release memory in qdio_setup_irq()
s390/qdio: fix access to uninitialized qdio_q fields
s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero
s390: use expoline thunks in the BPF JIT
s390: extend expoline to BC instructions
s390: remove indirect branch from do_softirq_own_stack
s390: move spectre sysfs attribute code
s390/kernel: use expoline for indirect branches
s390/ftrace: use expoline for indirect branches
s390/lib: use expoline for indirect branches
s390/crc32-vx: use expoline for indirect branches
s390: move expoline assembler macros to a header
vfio: ccw: fix cleanup if cp_prefetch fails
s390/kexec_file: add declaration of purgatory related globals
s390: update defconfigs
MAINTAINERS: update s390 zcrypt maintainers email address
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlr8kO8UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIrEtg/5AWIHjkXWgUnwtG+zswaZmzXRCIHi
Ixz/R7gDLBstLDORr0mZ19sllo9iQfiFfeKQL+8ewn5CM7vGViASBDrbscsU9QDI
imy5PLcJ4iVRcLhpgKCQWrz2kE3lIkK1UlpMTnsHR7wXeLrTKF4bSI/Rdyu6jApB
VnyOaeTp3BUKpY5mKURVP+N8jG/MF/kCx94lNlsBnVmPkbI8A8wALyZPZt9D7YRu
3FGRQQ9FM0HTGTplnfvDLoEH97Dk4MRTGaKpHj/kKuqviQDpf/JH6/fk1nQDgHkW
Mzj6YbMZddee7TDbhmmyvymaYNqcjbRiOiPBEodoDMHcN9Cba7gvtGA0J4/WSLaz
ZdVUdqG1E0P3qsda4/pf1FLDTXOtwmxk0J/fwOixnfnVIvb/mUGzJrxb2HqXQBjH
Mycd260b4LmZg1XSkAiBvF6XLanOx3VZHTMg5rsMgM2lZ8o7mH3nWwbEhy9qIuHp
gSq63NU/X43pB8dfGVxWvVKild2uA2wKO4Kl6hZ0DW4VdM5423qz67aYy38EIguk
cEvTGrFBqZy5ib1XzXSYjMsmHRZQAU2SDI4g6gjSTjK+WnzaUgliFN0EyS7IIK1c
us1gYIPa3LrQ7giUsCqyKAcp08tHSAHYw6z1vHS1tlu447EkTX6QzO99dMPtMzWd
69zSUhOtbYamaiA=
=SSWK
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux fixes from Paul Moore:
"A small pull request to fix a few regressions in the SELinux/SCTP code
with applications that call bind() with AF_UNSPEC/INADDR_ANY.
The individual commit descriptions have more information, but the
commits themselves should be self explanatory"
* tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: correctly handle sa_family cases in selinux_sctp_bind_connect()
selinux: fix address family in bind() and connect() to match address/port
selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
Paolo Abeni says:
====================
sched: refactor NOLOCK qdiscs
With the introduction of NOLOCK qdiscs, pfifo_fast performances in the
uncontended scenario degraded measurably, especially after the commit
eb82a99447 ("net: sched, fix OOO packets with pfifo_fast").
This series restore the pfifo_fast performances in such scenario back the
previous level, mainly reducing the number of atomic operations required to
perform the qdisc_run() call. Even performances in the contended scenario
increase measurably.
Note: This series is on top of:
sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can use lockdep on it.
The newly introduced sequence lock has the same scope of busylock,
so it shares the same lockdep annotation, but it's only used for
NOLOCK qdiscs.
With this changeset we acquire such lock in the control path around
flushing operation (qdisc reset), to allow more NOLOCK qdisc perf
improvement in the next patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch updates the NVM read/erase/update AQ commands to align with
the latest specification.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Set hw->mac.perm_addr in ixgbevf_set_mac() in order to avoid losing the
custom MAC on reset. This can happen in the following case:
>ip link set $vf address $mac
>ethtool -r $vf
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Do not validate the MAC address during a reset, unless the MAC was set on
the host. This way the VF will get a new MAC address every time it reloads.
Remove the "no MAC address assigned" message since it will get spammed on
reset and it doesn't help much as the MAC on the VF is randomly generated.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently, during device_shutdown() ixgbe holds rtnl_lock for the duration
of lengthy ixgbe_close_suspend(). On machines with multiple ixgbe cards
this lock prevents scaling if device_shutdown() function is multi-threaded.
It is not necessary to hold this lock during ixgbe_close_suspend()
as it is not held when ixgbe_close() is called also during shutdown but for
kexec case.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Since commit f7f37e7ff2 ("ixgbe: handle close/suspend race with
netif_device_detach/present") ixgbe_close_suspend is called, from
ixgbe_close, only if the device is present, i.e. if it isn't detached.
That exposed a situation where IRQs weren't freed if a PCI error
recovery system opts to remove the device. For such case the pci channel
state is set to pci_channel_io_perm_failure and ixgbe_io_error_detected
was returning PCI_ERS_RESULT_DISCONNECT before calling
ixgbe_close_suspend consequentially not freeing IRQ and crashing when
the remove handler calls pci_disable_device, hitting a BUG_ON at
free_msi_irqs, which asserts that there is no non-free IRQ associated
with the device to be removed:
BUG_ON(irq_has_action(entry->irq + i));
The issue is fixed by calling the ixgbe_close_suspend before evaluate
the pci channel state.
Reported-by: Naresh Bannoth <nbannoth@in.ibm.com>
Reported-by: Abdul Haleem <abdhalee@in.ibm.com>
Signed-off-by: Mauro S M Rodrigues <maurosr@linux.vnet.ibm.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.
This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.
Changelog:
v4: Add Acked-by from Kent Overstreet.
v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS.
v2: Remove a warning information
v1: Initial version.
Fixes: Commit 539d39eb27 ("bcache: fix wrong return value in bch_debug_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Massimo B. <massimo.b@gmx.net>
Reported-by: Kai Krakow <kai@kaishome.de>
Tested-by: Kai Krakow <kai@kaishome.de>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sparse complains valid conversions between restricted types, force
attribute is used to avoid those warnings.
Signed-off-by: Cathy Zhou <cathy.zhou@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Similarly to opal_event_shutdown, opal_nvram_write can be called in
the crash path with irqs disabled. Special case the delay to avoid
sleeping in invalid context.
Fixes: 3b8070335f ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops")
Cc: stable@vger.kernel.org # v3.2
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add I2C/SMBUS Driver entry for STM32 family from ST Microelectronics.
Signed-off-by: Pierre-Yves MORDRET <pierre-yves.mordret@st.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
We set the BTRFS_BALANCE_RESUME flag in the btrfs_recover_balance()
only, which isn't called during the remount. So when resuming from
the paused balance we hit the bug:
kernel: kernel BUG at fs/btrfs/volumes.c:3890!
::
kernel: balance_kthread+0x51/0x60 [btrfs]
kernel: kthread+0x111/0x130
::
kernel: RIP: btrfs_balance+0x12e1/0x1570 [btrfs] RSP: ffffba7d0090bde8
Reproducer:
On a mounted filesystem:
btrfs balance start --full-balance /btrfs
btrfs balance pause /btrfs
mount -o remount,ro /dev/sdb /btrfs
mount -o remount,rw /dev/sdb /btrfs
To fix this set the BTRFS_BALANCE_RESUME flag in
btrfs_resume_balance_async().
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:
"VFS: Busy inodes after unmount of Self-destruct in 5 seconds. Have a nice day..."
Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
in free_fs_root for the same reason.
This patch aims to rectify the sitaution by doing the following:
1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.
2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.
Note: this patch is tagged for 4.14 stable but the fix applies to older
versions too but needs to be backported manually due to conflicts.
CC: stable@vger.kernel.org # 4.14.x: 2b8773313494: btrfs: Split btrfs_del_delalloc_inode into 2 functions
CC: stable@vger.kernel.org # 4.14.x
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment to igrab ]
Signed-off-by: David Sterba <dsterba@suse.com>
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.
btrfs_search_slot()
read_block_for_search() # hold parent and its lock, go to read child
btrfs_release_path()
read_tree_block() # read child
Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cb ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block. It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.
A simple PoC is included in btrfs/124,
0. A two-disk raid1 btrfs,
1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
2. Mount this filesystem and put it in use, after a while, device tree's
root got COW but block A hasn't been allocated/overwritten yet.
3. Umount it and reload the btrfs module to remove both disks from the
global @fs_devices list.
4. mount -odegraded dev1 and write some data, so now block A is allocated
to be a leaf in checksum tree. Note that only dev1 has the latest
metadata of this filesystem.
5. Umount it and mount it again normally (with both disks), since raid1
can pick up one disk by the writer task's pid, if btrfs_search_slot()
needs to read block A, dev2 which does NOT have the latest metadata
might be read for block A, then we got a stale block A.
6. As parent transid is not checked, block A is marked as uptodate and
put into the extent buffer cache, so the future search won't bother
to read disk again, which means it'll make changes on this stale
one and make it dirty and flush it onto disk.
To avoid the problem, parent transid needs to be passed to
read_tree_block().
In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.
This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c176041). The fix is to replace 0 by 'gen'.
Fixes: 5bdd3536cb ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Incompat flag of LZO/ZSTD compression should be set at:
1. mount time (-o compress/compress-force)
2. when defrag is done
3. when property is set
Currently 3. is missing and this commit adds this.
This could lead to a filesystem that uses ZSTD but is not marked as
such. If a kernel without a ZSTD support encounteres a ZSTD compressed
extent, it will handle that but this could be confusing to the user.
Typically the filesystem is mounted with the ZSTD option, but the
discrepancy can arise when a filesystem is never mounted with ZSTD and
then the property on some file is set (and some new extents are
written). A simple mount with -o compress=zstd will fix that up on an
unpatched kernel.
Same goes for LZO, but this has been around for a very long time
(2.6.37) so it's unlikely that a pre-LZO kernel would be used.
Fixes: 5c1aab1dd5 ("btrfs: Add zstd support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add user visible impact ]
Signed-off-by: David Sterba <dsterba@suse.com>
In commit 471d557afe ("Btrfs: fix loss of prealloc extents past i_size
after fsync log replay"), on fsync, we started to always log all prealloc
extents beyond an inode's i_size in order to avoid losing them after a
power failure. However under some cases this can lead to the log replay
code to create duplicate extent items, with different lengths, in the
extent tree. That happens because, as of that commit, we can now log
extent items based on extent maps that are not on the "modified" list
of extent maps of the inode's extent map tree. Logging extent items based
on extent maps is used during the fast fsync path to save time and for
this to work reliably it requires that the extent maps are not merged
with other adjacent extent maps - having the extent maps in the list
of modified extents gives such guarantee.
Consider the following example, captured during a long run of fsstress,
which illustrates this problem.
We have inode 271, in the filesystem tree (root 5), for which all of the
following operations and discussion apply to.
A buffered write starts at offset 312391 with a length of 933471 bytes
(end offset at 1245862). At this point we have, for this inode, the
following extent maps with the their field values:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
block_len 376832, orig_block_len 376832
em C, start 417792, orig_start 417792, len 782336, block_start
18446744073709551613, block_len 0, orig_block_len 0
em D, start 1200128, orig_start 1200128, len 835584, block_start
1106776064, block_len 835584, orig_block_len 835584
em E, start 2035712, orig_start 2035712, len 245760, block_start
1107611648, block_len 245760, orig_block_len 245760
Extent map A corresponds to a hole and extent maps D and E correspond to
preallocated extents.
Extent map D ends where extent map E begins (1106776064 + 835584 =
1107611648), but these extent maps were not merged because they are in
the inode's list of modified extent maps.
An fsync against this inode is made, which triggers the fast path
(BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
of the data previously written using buffered IO, and when the respective
ordered extent finishes, btrfs_drop_extents() is called against the
(aligned) range 311296..1249279. This causes a split of extent map D at
btrfs_drop_extent_cache(), replacing extent map D with a new extent map
D', also added to the list of modified extents, with the following
values:
em D', start 1249280, orig_start of 1200128,
block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
orig_block_len 835584,
block_len 786432 (835584 - (1249280 - 1200128))
Then, during the fast fsync, btrfs_log_changed_extents() is called and
extent maps D' and E are removed from the list of modified extents. The
flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
clear_em_logging() is called on each of them, and that makes extent map E
to be merged with extent map D' (try_merge_map()), resulting in D' being
deleted and E adjusted to:
em E, start 1249280, orig_start 1200128, len 1032192,
block_start 1106825216, block_len 1032192,
orig_block_len 245760
A direct IO write at offset 1847296 and length of 360448 bytes (end offset
at 2207744) starts, and at that moment the following extent maps exist for
our inode:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
block_len 937984, orig_block_len 937984
em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
block_start 1106825216, block_len 1032192, orig_block_len 245760
The dio write results in drop_extent_cache() being called twice. The first
time for a range that starts at offset 1847296 and ends at offset 2035711
(length of 188416), which results in a double split of extent map E,
replacing it with two new extent maps:
em F, start 1249280, orig_start 1200128, block_start 1106825216,
block_len 598016, orig_block_len 598016
em G, start 2035712, orig_start 1200128, block_start 1107611648,
block_len 245760, orig_block_len 1032192
It also creates a new extent map that represents a part of the requested
IO (through create_io_em()):
em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
The second call to drop_extent_cache() has a range with a start offset of
2035712 and end offset of 2207743 (length of 172032). This leads to
replacing extent map G with a new extent map I with the following values:
em I, start 2207744, orig_start 1200128, block_start 1107783680,
block_len 73728, orig_block_len 1032192
It also creates a new extent map that represents the second part of the
requested IO (through create_io_em()):
em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
The dio write set the inode's i_size to 2207744 bytes.
After the dio write the inode has the following extent maps:
em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
block_len 0, orig_block_len 0
em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
block_len 270336, orig_block_len 376832
em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
block_len 937984, orig_block_len 937984
em F, start 1249280, orig_start 1200128, len 598016,
block_start 1106825216, block_len 598016, orig_block_len 598016
em H, start 1847296, orig_start 1200128, len 188416,
block_start 1107423232, block_len 188416, orig_block_len 835584
em J, start 2035712, orig_start 2035712, len 172032,
block_start 1107611648, block_len 172032, orig_block_len 245760
em I, start 2207744, orig_start 1200128, len 73728,
block_start 1107783680, block_len 73728, orig_block_len 1032192
Now do some change to the file, like adding a xattr for example and then
fsync it again. This triggers a fast fsync path, and as of commit
471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync
log replay"), we use the extent map I to log a file extent item because
it's a prealloc extent and it starts at an offset matching the inode's
i_size. However when we log it, we create a file extent item with a value
for the disk byte location that is wrong, as can be seen from the
following output of "btrfs inspect-internal dump-tree":
item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
generation 22 type 2 (prealloc)
prealloc data disk byte 1106776064 nr 1032192
prealloc data offset 1007616 nr 73728
Here the disk byte value corresponds to calculation based on some fields
from the extent map I:
1106776064 = block_start (1107783680) - 1007616 (extent_offset)
extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
The disk byte value of 1106776064 clashes with disk byte values of the
file extent items at offsets 1249280 and 1847296 in the fs tree:
item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
generation 20 type 2 (prealloc)
prealloc data disk byte 1106776064 nr 835584
prealloc data offset 49152 nr 598016
item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
generation 20 type 1 (regular)
extent data disk byte 1106776064 nr 835584
extent data offset 647168 nr 188416 ram 835584
extent compression 0 (none)
item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
generation 20 type 1 (regular)
extent data disk byte 1107611648 nr 245760
extent data offset 0 nr 172032 ram 245760
extent compression 0 (none)
item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
generation 20 type 2 (prealloc)
prealloc data disk byte 1107611648 nr 245760
prealloc data offset 172032 nr 73728
Instead of the disk byte value of 1106776064, the value of 1107611648
should have been logged. Also the data offset value should have been
172032 and not 1007616.
After a log replay we end up getting two extent items in the extent tree
with different lengths, one of 835584, which is correct and existed
before the log replay, and another one of 1032192 which is wrong and is
based on the logged file extent item:
item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
refs 2 gen 15 flags DATA
extent data backref root 5 objectid 271 offset 1200128 count 2
item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
refs 1 gen 22 flags DATA
extent data backref root 5 objectid 271 offset 1200128 count 1
Obviously this leads to many problems and a filesystem check reports many
errors:
(...)
checking extents
Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
extent item 1106776064 has multiple extent items
ref mismatch on [1106776064 835584] extent item 2, found 3
Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
backpointer mismatch on [1106776064 835584]
checking free space cache
block group 1103101952 has wrong amount of free space
failed to load free space cache for block group 1103101952
checking fs roots
(...)
So fix this by logging the prealloc extents beyond the inode's i_size
based on searches in the subvolume tree instead of the extent maps.
Fixes: 471d557afe ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Disabling pm runtime at probe is not sufficient to get BAM working
on remotely controller instances. pm_runtime_get_sync() would return
-EACCES in such cases.
So check if runtime pm is enabled before returning error from bam functions.
Fixes: 5b4a68952a ("dmaengine: qcom: bam_dma: disable runtime pm on remote controlled")
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A single fix for a recent regression.
* 'vmwgfx-fixes-4.17' of git://people.freedesktop.org/~thomash/linux:
drm/vmwgfx: Set dmabuf_size when vmw_dmabuf_init is successful
- vc4: Fix memory leak on driver close (Eric)
- dumb-buffers: Prevent overflow in DIV_ROUND_UP() (Dan)
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Eric Anholt <eric@anholt.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEfxcpfMSgdnQMs+QqlvcN/ahKBwoFAlr8hJYACgkQlvcN/ahK
Bwo/RAf9EbvVp9r9B0yAzChimHod4MSFNY0dqAGy84DbgQ1dJ6vhF0HeRoiVbTo8
1BrH8x212boMQHnVonIm9+XxW8pqX6ynTDEO4squTapLwgAZ06iP5tcnhpd6sawA
zTQx91uYOfDYH1NeIDwnmvnCZuC8U+/RIak7AJx0AJrXJLEb52Brl9smrndovAtJ
iYIryKF5tEMm5hzfv/Vna78oWcw1uPGjhxiL5gULCFsIqesTA8fS4+Kp8Opr38zb
9bMpBTYEkobcHEtXnGVjy8azvWn3QYkOQzi9oOfGV/XVD1w7HvGoNjg29WPte0m1
rap9ZDWj6iwRXPgZ1Faj6VDwPYm9kA==
=H2nb
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-fixes-2018-05-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
- core: Fix regression in dev node offsets (Haneen)
- vc4: Fix memory leak on driver close (Eric)
- dumb-buffers: Prevent overflow in DIV_ROUND_UP() (Dan)
Cc: Haneen Mohammed <hamohammed.sa@gmail.com>
Cc: Eric Anholt <eric@anholt.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
* tag 'drm-misc-fixes-2018-05-16' of git://anongit.freedesktop.org/drm/drm-misc:
drm/dumb-buffers: Integer overflow in drm_mode_create_ioctl()
drm/vc4: Fix leak of the file_priv that stored the perfmon.
drm: Match sysfs name in link removal to link creation
When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.
Fixes: 1cbe6fc86c ("IB/mlx5: Add support for CQE compressing")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.
Fixes: fed9ce22bf ("net/mlx5: E-Switch, Add API to create vport rx rules")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used to
free it.
Fixes: 9efa752545 ("net/mlx5_core: Introduce access functions to query vport RoCE fields")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When an error happens in the update sockmap element logic also pass
the err up to the user.
Fixes: e5cd3abcb3 ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
a field event. This is increasingly important for the histogram trigger
work that is being extended.
While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating
a static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.
As trace events were never intended on being a marker to denote
that a function was hit or not, especially since function tracing
and kprobes can trivially do the same, the best course of action is
to simply remove these events.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWvtgDhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtY0AQC2HSSRkP5GVL1/c1Xoxl202O1tQ9Dp
G08oci4bfcRCIAEA8ATc+1LZPGQUvd0ucrD4FiJnfpYUHrCTvvRsz4d9LQQ=
=HUQR
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Some of the ftrace internal events use a zero for a data size of a
field event. This is increasingly important for the histogram trigger
work that is being extended.
While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating a
static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.
As trace events were never intended on being a marker to denote that a
function was hit or not, especially since function tracing and kprobes
can trivially do the same, the best course of action is to simply
remove these events"
* tag 'trace-v4.17-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}
syzbot reported a kernel warning below:
WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
panic+0x22f/0x4de kernel/panic.c:184
__warn.cold.8+0x163/0x1b3 kernel/panic.c:536
report_bug+0x252/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
__do_kmalloc mm/slab.c:3713 [inline]
__kmalloc+0x25/0x760 mm/slab.c:3727
kmalloc include/linux/slab.h:517 [inline]
map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
__do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
__se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
__x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The test case is against sock hashmap with a key size 0xffffffe1.
Such a large key size will cause the below code in function
sock_hash_alloc() overflowing and produces a smaller elem_size,
hence map creation will be successful.
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
Later, when map_get_next_key is called and kernel tries
to allocate the key unsuccessfully, it will issue
the above warning.
Similar to hashtab, ensure the key size is at most
MAX_BPF_STACK for a successful map creation.
Fixes: 8111038444 ("bpf: sockmap, add hash map support")
Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF programs currently can only be offloaded using iproute2. This
patch will allow programs to be offloaded using libbpf calls.
Signed-off-by: David Beckett <david.beckett@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
function is used twice in verifier.c in both cases the caller function
already uses the __printf gcc attribute.
Remove the following warning, triggered with W=1:
kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Only consider forwarding packets if ttl in received packet is > 1 and
decrement ttl before handing off to bpf_redirect_map.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
John Fastabend says:
====================
In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.
This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.
To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.
v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
its own patch, and fixup the test patch stats to catch the
renamed file and provide only diffs ± on that.
v4: Add documentation to UAPI bpf.h
v5: Add documentation to tools UAPI bpf.h
v6: 'git add' test_sockhash_kern.c which was previously missing
but was not causing issues because of typo in test script,
noticed by Daniel. After this the git format-patch -M option
no longer tracks the rename of the test_sockmap_kern files for
some reason. I guess the diff has exceeded some threshold.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.
We then run the entire test suite with each type.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For T6, clip table is separated from main TCAM. So, update LE-TCAM
collection logic to collect clip table TCAM as well. IPv6 takes
4 entries in clip table TCAM compared to 2 entries in main TCAM.
Also, in case of errors, keep LE-TCAM collected so far and set the
status to partial dump.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit b196d88aba ("tun: fix use after free for ptr_ring") we
need clean up tx ring during release(). But unfortunately, it tries to
do the cleanup blindly after socket were destroyed which will lead
another use-after-free. Fix this by doing the cleanup before dropping
the last reference of the socket in __tun_detach().
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Fixes: b196d88aba ("tun: fix use after free for ptr_ring")
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kalderon says:
====================
qed: LL2 fixes
This series fixes some issues in ll2 related to synchronization
and resource freeing
====================
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stress on qedi/qedr load unload lead to list_del corruption.
This is due to ll2 connection terminate freeing resources without
verifying that no more ll2 processing will occur.
This patch unregisters the ll2 status block before terminating
the connection to assure this race does not occur.
Fixes: 1d6cff4fca ("qed: Add iSCSI out of order packet handling")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ll2 flows of flushing the txq/rxq need to be synchronized with the
regular fp processing. Caused list corruption during load/unload stress
tests.
Fixes: 0a7fb11c23 ("qed: Add Light L2 support")
Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com>
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>