Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.
He described the race as follows:
CPU0 CPU1
---- ----
user accesses memory using RW PTE
[PTE now cached in TLB]
try_to_unmap_one()
==> ptep_get_and_clear()
==> set_tlb_ubc_flush_pending()
mprotect(addr, PROT_READ)
==> change_pte_range()
==> [ PTE non-present - no flush ]
user writes using cached RW PTE
...
try_to_unmap_flush()
The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.
For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access. For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB. However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.
Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption. In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch. Other users such as gup as a race with
reclaim looks just at PTEs. huge page variants should be ok as they
don't race with reclaim. mincore only looks at PTEs. userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.
Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected. His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.
[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org> [v4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit 3d375d7859 ("mm: update callers to use HASH_ZERO flag"),
drop unused pidhash_size in pidhash_init().
Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Pavel Tatashin <Pasha.Tatashin@Oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9a291a7c94 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
errors from follow_hugetlb_page. After such error, __get_user_pages
subsequently calls faultin_page on the same VMA and start address that
follow_hugetlb_page failed on instead of returning the error immediately
as it should.
In follow_hugetlb_page, when hugetlb_fault returns a value covered under
VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
to 0 as __get_user_pages expects in this case, which causes the
following to happen in __get_user_pages: the "while (nr_pages)" check
succeeds, we skip the "if (!vma..." check because we got a VMA the last
time around, we find no page with follow_page_mask, and we call
faultin_page, which calls hugetlb_fault for the second time.
This issue also slightly changes how __get_user_pages works. Before, it
only returned error if it had made no progress (i = 0). But now,
follow_hugetlb_page can clobber "i" with an error code since its new
return path doesn't check for progress. So if "i" is nonzero before a
failing call to follow_hugetlb_page, that indication of progress is lost
and __get_user_pages can return error even if some pages were
successfully pinned.
To fix this, change follow_hugetlb_page so that it updates nr_pages,
allowing __get_user_pages to fail immediately and restoring the "error
only if no progress" behavior to __get_user_pages.
Tested that __get_user_pages returns when expected on error from
hugetlb_fault in follow_hugetlb_page.
Fixes: 9a291a7c94 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: James Morse <james.morse@arm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: <stable@vger.kernel.org> [4.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
According to the Intel SDM, software cannot rely on the current VMCS to be
coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
flushes.
24.11.1 Software Use of Virtual-Machine Control Structures
...
If a logical processor leaves VMX operation, any VMCSs active on
that logical processor may be corrupted (see below). To prevent
such corruption of a VMCS that may be used either after a return
to VMX operation or on another logical processor, software should
execute VMCLEAR for that VMCS before executing the VMXOFF instruction
or removing power from the processor (e.g., as part of a transition
to the S3 and S4 power states).
...
This fixes a "suspicious rcu_dereference_check() usage!" warning during
kvm_vm_release() because nested_release_vmcs12() calls
kvm_vcpu_write_guest_page() without holding kvm->srcu.
Signed-off-by: David Matlack <dmatlack@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Since the current implementation of VMCS12 does a memcpy in and out
of guest memory, we do not need current_vmcs12 and current_vmcs12_page
anymore. current_vmptr is enough to read and write the VMCS12.
And David Matlack noted:
This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
VMCS12 page by using kvm_write_guest. nested_release_page() only marks
the struct page dirty.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Added David Matlack's note and nested_release_page_clean() fix.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
During teardown, accesses to memslots and buses are using
rcu_dereference_protected with an always-true condition because
these accesses are done outside the usual mutexes. This
is because the last reference is gone and there cannot be any
concurrent modifications, but rcu_dereference_protected is
ugly and unobvious.
Instead, check the refcount in kvm_get_bus and __kvm_memslots.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
'lapic_irq' is a local variable and its 'level' field isn't
initialized, so 'level' is random, it doesn't matter but
makes UBSAN unhappy:
UBSAN: Undefined behaviour in .../lapic.c:...
load of value 10 is not a valid value for type '_Bool'
...
Call Trace:
[<ffffffff81f030b6>] dump_stack+0x1e/0x20
[<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
[<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
[<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
[<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
[<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
[<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
[<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
[<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
...
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When SMP VM start, AP may lost INIT because of receiving INIT between
kvm_vcpu_ioctl_x86_get/set_vcpu_events.
vcpu 0 vcpu 1
kvm_vcpu_ioctl_x86_get_vcpu_events
events->smi.latched_init = 0
send INIT to vcpu1
set vcpu1's pending_events
kvm_vcpu_ioctl_x86_set_vcpu_events
if (events->smi.latched_init == 0)
clear INIT in pending_events
This patch fixes it by just update SMM related flags if we are in SMM.
Thanks Peng Hao for the report and original commit message.
Reported-by: Peng Hao <peng.hao2@zte.com.cn>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Otherwise bo->shadow_list (which is aliased by bo->mn_list) will not
appear empty in amdgpu_ttm_bo_destroy and cause an oops when freeing
former userptr BOs.
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As I was staring at the si_init_golden_registers code, I noticed that
the Pitcairn initialization silently falls through the Cape Verde
initialization, and the Oland initialization falls through the Hainan
initialization. However there is no comment stating that this is
intentional, and the radeon driver doesn't have any such fallthrough,
so I suspect this is not supposed to happen.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: 62a3755341 ("drm/amdgpu: add si implementation v10")
Cc: Ken Wang <Qingqing.Wang@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Marek Olšák" <maraeo@gmail.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Flora Cui <Flora.Cui@amd.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
One critical clock fix for sun5i (A10s/A13/R8) which enables propogation
of clock rate changes from the "cpu" clock to it's parent PLL clock.
This fixes cpufreq related crashes that have been observed on KernelCI
with the C.H.I.P. and multi_v7_defconfig.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCgAsFiEE2nN1m/hhnkhOWjtHOJpUIZwPJDAFAlmBWB8OHHdlbnNAY3Np
ZS5vcmcACgkQOJpUIZwPJDBGvRAAkwO1p1Aw62i5QmrGox3iyWabH/Rbon0M+tLQ
jP+kVukWrLR8REclkpwXqSK07Blg5m3o1MVMJXBKKOD6NAKCPOJoN75K/aX44fzZ
msrTuuzGkixovkG7p8ln253v/TOoBMwwK+QDtTOk7zBNEZuRKevKAUOglQTJYTmy
um7cVsisv5Yr89vtG767o0UyP82TUbnga6yLYY+72PbEKP1MnJtR0gzQp6Y9zxmW
Oyt7MaHH7km6SDGrQUWgQ8SvZ36s3uLxPfrtPx0zbTE1sM31sgdHf4kZmciP7Yap
eiLoHxV+twVxrgYh/Fl7BxOMfSqdIpZrzL3z2N3Hc53oARjOSl9c3N12UVzC0D0K
lRldFpmW26U21xMp22033j6uW9daU2PmyybNyJu892DRfY8RcX2N7otwIgAtxEtY
SKvVaG/uin2WVXs5FsxbpBUOm5xEJ04547TXuEegGjKyEv8+d1lCO6yU8i6xdUE0
qBzNm5rNuuSTD5zKu41YpEWhMWxEFEJE+Cx/tsCOhaySDuKKOrV7Bbd15ec3PjRD
llv2TvJNiBAkm4pakOsLZcDKk6RWqBEoN+6GXuC/66Dm+k+rR0nsRthDk418pDF2
T8YgG1p5wETdh/DfnUnwg63xN/dVs4On8IJkGunznTTGMxHiYduQsmqAxkgaRq6h
k4insAE=
=uoCy
-----END PGP SIGNATURE-----
Merge tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into clk-fixes
Pull one Allwinner clock fix from Chen-Yu Tsai:
One critical clock fix for sun5i (A10s/A13/R8) which enables propagation
of clock rate changes from the "cpu" clock to it's parent PLL clock.
This fixes cpufreq related crashes that have been observed on KernelCI
with the C.H.I.P. and multi_v7_defconfig.
* tag 'sunxi-clk-fixes-for-4.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
clk: sunxi-ng: sun5i: Add clk_set_rate_parent to the CPU clock
This seven is mostly minor build, Kconfig and error leg fixes.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJZgeACAAoJEAVr7HOZEZN4vxwP/0GZc1EWDubS8DZZOr4X25WA
jofQFCrWYf9jBcCA64pv7kF5wCbQk0MFXMvs2u5YHGMV8U4VyquNT2zEYm+R5ZD8
C3sIdgZyxb7x+KkUJ5NxvDBIzAQuQcPEUPZspgi809veoATQ28qoMO1NkOJzbJGE
nUs/6TZcpA5xGdqXtI3G6IsbvuAFO1hmI2aIUAj5UVKV+B9ILGsxWHPzYPErgtra
afTJ8jO4vYW/wcyY7/DG/V8UAvU/0pe9NZqgt0Gn+XiSq5C9HnqS9+BWjU9nx98g
EiPuvP6UBq7zGmjFQdUTwYuysbPDwyVfB1l7RJFVeBTPj6sINFrqTQPU8qwy96S8
gxfObnfbEkx4zJmv4iP8HtAgRQMVsVN64ICD6oeS4JQWmXoz90S0vCLwrTQv36ll
UHmBaSKMqPHTt+emEmup3rcOBU+4PIVq40jQjUyBCJXd/kg5GUz7Og++/NHX8JDZ
qhVJ/VhkBacnDD9fZLQ2cMfcB/E5pUmEgx3MycZty8pXx0AkttWyB0lIddln2yiL
qpZwbJ4nbhS7TO1/ScZlHEqp/rVY3aaIvLovWrhrmy4N1KBk41EmSxCokxkv+Kmd
Twy35eHLDT9Hn0emT+y12Ul5y0cTxy1JMOcMf6LLXAHQsKgg2khm3JfHcXC7H4mO
H/FpxwQT7z47yU3FF2mZ
=/+Ot
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"These seven patches are mostly minor build, Kconfig and error leg
fixes"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qedi: Fix return code in qedi_ep_connect()
scsi: lpfc: fix linking against modular NVMe support
scsi: scsi_transport_fc: return -EBUSY for deleted vport
scsi: libcxgbi: add check for valid cxgbi_task_data
scsi: aic7xxx: fix firmware build with O=path
scsi: megaraid_sas: fix memleak in megasas_alloc_cmdlist_fusion
scsi: qedi: Add ISCSI_BOOT_SYSFS to Kconfig
Quite a few fixes here that have been sent since the merge window, the
biggest one is the fix from Tony for some confusion with the device
property API which was causing issues with the of-graph card. This is
fixed with some changes in the graph API itself as it seemed very likely
to be error prone.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAlmB51ATHGJyb29uaWVA
a2VybmVsLm9yZwAKCRAk1otyXVSH0G39B/90DuMEckKIUVTPyIpyHetKl46eFRDJ
9t3EnovAZJX6zZ7bdlkcCg33wGmcIkn9PFfLpPcnbxYeKP0xGah/GGKec1K/aeTn
FiD2LjmO5Cc0adk5tEwqiVrL0cOVrUdXuMpvhoMQqD1EBzsgDQtUpG6EL8SGbrtd
dfSLFDrYHIBO2b3iZOZPJvqMPnh9jJ4zEyBwVpQGOCoQdD/AP5S84McxwGhlJfl4
UqAY8keZYlW/Gl1DSnc8BDDCuLrexdCEUAR/0Vz3QFQk+sSnKVlkE+XXo0L4AVZl
XoDGjB5DL8FjXPG7yKO0rAytFFFzSJCWwBRnESAIGKz9o5HIUfOZtE8Q
=10Px
-----END PGP SIGNATURE-----
Merge tag 'asoc-fix-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v4.13
Quite a few fixes here that have been sent since the merge window, the
biggest one is the fix from Tony for some confusion with the device
property API which was causing issues with the of-graph card. This is
fixed with some changes in the graph API itself as it seemed very likely
to be error prone.
The number of pins in South Bridge is 30 and not 29. There is a fix for
the driver for the pinctrl, but a fix is also need at device tree level
for the GPIO.
Fixes: afda007fed ("ARM64: dts: marvell: Add pinctrl nodes for Armada
3700")
Cc: <stable@vger.kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
rpc_clnt_add_xprt() expects the callback function to be synchronous, and
expects to release the transport and switch references itself.
Fixes: 04fa2c6bb5 ("NFS pnfs data server multipath session trunking")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Sony VAIO VPCL14M1R needs the quirk to make the speaker working properly.
Tested-by: Dmitriy <mexx400@yandex.ru>
Cc: <stable@vger.kernel.org>
Signed-off-by: Sergei A. Trusov <sergei.a.trusov@ya.ru>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
of_irq_to_resource() has recently been fixed to return negative error #'s
along with 0 in case of failure, however the Freescale MPC832x RDB board
code still only regards 0 as a failure indication -- fix it up.
Fixes: 7a4228bbff ("of: irq: use of_irq_get() in of_irq_to_resource()")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When more than one GPIO IRQs are triggered simultaneously,
tegra_gpio_irq_handler() called chained_irq_exit() multiple
times for one chained_irq_enter().
Fixes: 3c92db9ac0
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
[Also changed the variable to a bool]
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
When I converted dell-wmi to the new bus infrastructure, I left the
call to dell_wmi_check_descriptor_buffer() in dell_wmi_init(). This
could cause two problems:
- An error message when loading the driver on a system without
dell-wmi. We'd try to read the event descriptor even if the WMI
GUID wasn't there.
- A possible race if dell-wmi was loaded manually before wmi was
fully initialized.
Fix it by moving the call to the probe function where it belongs.
Fixes: bff589be59 ("platform/x86: dell-wmi: Convert to the WMI bus infrastructure")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
The verifier is allocated on the stack, but the EXCHANGE_ID RPC call was
changed to be asynchronous by commit 8d89bd70bc. If we interrrupt
the call to rpc_wait_for_completion_task(), we can therefore end up
transmitting random stack contents in lieu of the verifier.
Fixes: 8d89bd70bc ("NFS setup async exchange_id")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
WARNING: CPU: 5 PID: 1242 at kernel/rcu/tree_plugin.h:323 rcu_note_context_switch+0x207/0x6b0
CPU: 5 PID: 1242 Comm: unity-settings- Not tainted 4.13.0-rc2+ #1
RIP: 0010:rcu_note_context_switch+0x207/0x6b0
Call Trace:
__schedule+0xda/0xba0
? kvm_async_pf_task_wait+0x1b2/0x270
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30
RIP: 0010:__d_lookup_rcu+0x90/0x1e0
I encounter this when trying to stress the async page fault in L1 guest w/
L2 guests running.
Commit 9b132fbe54 (Add rcu user eqs exception hooks for async page
fault) adds rcu_irq_enter/exit() to kvm_async_pf_task_wait() to exit cpu
idle eqs when needed, to protect the code that needs use rcu. However,
we need to call the pair even if the function calls schedule(), as seen
from the above backtrace.
This patch fixes it by informing the RCU subsystem exit/enter the irq
towards/away from idle for both n.halted and !n.halted.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
There are three issues in nested_vmx_check_exception:
1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
by Wanpeng Li;
2) it should rebuild the interruption info and exit qualification fields
from scratch, as reported by Jim Mattson, because the values from the
L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
a page fault, the EPT misconfig's exit qualification is incorrect).
3) CR2 and DR6 should not be written for exception intercept vmexits
(CR2 only for AMD).
This patch fixes the first two and adds a comment about the last,
outlining the fix.
Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do this in the caller of nested_vmx_vmexit instead.
nested_vmx_check_exception was doing a vmwrite to the vmcs02's
VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
the field to vmcs12->vm_exit_intr_error_code. However that isn't
possible on pre-Haswell machines. Moving the vmcs12 write to the
callers fixes it.
Reported-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
thanks to fengguang.wu@intel.com]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull parsic fixes from Helge Deller:
- Our cache flushing code ran into a BUG in case context is not
current. Fix it by flushing the whole cache in such rare situations
(by Dave Anglin).
- Fix a "sleeping function called from invalid context BUG" in our
pdc_stable driver by rearranging our locks (by James Bottomley)
- The thread and irq stacks require more than 16 KB since kernel 4.11.
Increase both to 32 KB.
- Define CONFIG_CPU_BIG_ENDIAN unconditionally on parisc to avoid wrong
behaviour in qrwlock functions (by Babu Moger).
* 'parisc-4.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Define CONFIG_CPU_BIG_ENDIAN
parisc: pdc_stable: Fix locking when creating sysfs links
parisc: Increase thread and stack size to 32kb
parisc: Handle vma's whose context is not current in flush_cache_range
This is needed so that the OSDs can regenerate the missing set at the
start of a new interval where support for recovery deletes changed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
- apply both pg_upmap and pg_upmap_items
- allow bidirectional swap of pg-upmap-items
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
There is now a fallback to a choose_arg index of -1 if there isn't
a pool-specific choose_arg set. If you create a per-pool weight-set,
that works for that pool. Otherwise we try the compat/default one. If
that doesn't exist either, then we use the normal CRUSH weights.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reencoding an already reencoded message is a bad idea. This could
happen on Policy::stateful_server connections (!CEPH_MSG_CONNECT_LOSSY),
such as MDS sessions.
This didn't pop up in testing because currently only OSD requests are
reencoded and OSD sessions are always lossy.
Fixes: 98ad5ebd15 ("libceph: ceph_connection_operations::reencode_message() method")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Messages allocated out of ceph_msgpool have a fixed front length
(pool->front_len). Asserting that the entire front has been filled
while encoding is thus wrong.
Fixes: 8cb441c054 ("libceph: MOSDOp v8 encoding (actual spgid + full hash)")
Reported-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Fix inconsistent use of of_graph_get_port_parent() where
asoc_simple_card_parse_graph_dai() does of_node_get() before
calling it while other callers do not. We can fix this by
not trashing the node passed to of_graph_get_port_parent().
Let's also make sure the callers have correct refcounts and remove
related incorrect of_node_put() calls for of_for_each_phandle
as that's done by of_phandle_iterator_next() except when
we break out of the loop early.
Let's fix both issues with a single patch to avoid kobject
refcounts getting messed up more if two patches are merged
separately.
Otherwise strange issues can happen caused by memory corruption
caused by too many kobject_del() calls such as:
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:747
...
(___might_sleep)
(__mutex_lock)
(mutex_lock_nested)
(kernfs_remove)
(kobject_del)
(kobject_put)
(of_get_next_parent)
(of_graph_get_port_parent)
(asoc_simple_card_parse_graph_dai [snd_soc_simple_card_utils])
(asoc_graph_card_probe [snd_soc_audio_graph_card])
Fixes: 0ef472a973 ("of_graph: add of_graph_get_port_parent()")
Fixes: 2692c1c63c ("ASoC: add audio-graph-card support")
Fixes: 1689333f83 ("ASoC: simple-card-utils: add asoc_simple_card_parse_graph_dai()")
Signed-off-by: Tony Lindgren <tony@atomide.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Tested-by: Antonio Borneo <borneo.antonio@gmail.com>
Tested-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
For e.g. HZ=100, timer being 430 jiffies in the future, and 32 bit
unsigned int, there is an overflow on unsigned int right-hand side
of the expression which results with wrong values being returned.
Type cast the multiplier to 64bit to avoid that issue.
Fixes: 46c8f0b077 ("timers: Fix get_next_timer_interrupt() computation")
Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Cc: khilman@baylibre.com
Cc: akpm@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/a7900f04-2a21-c9fd-67be-ab334d459ee5@nokia.com
mpll0 clock is special compared to the other mplls. It needs another
bit (ssen) to be set to activate the fractional part the mpll divider
Fixes: 007e6e5c5f ("clk: meson: mpll: add rw operation")
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
The previous fix for filtering out of unwatched events was not entirely
correct. Instead of skipping the events we don't want, they are now
interpreted as events with opposing edge.
In order to fix it: always read the GPIO line value on interrupt and
only emit the event if it corresponds with the event type we requested.
Cc: stable@vger.kernel.org
Fixes: ad537b8225 ("gpiolib: fix filtering out unwanted events")
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
The device-specific property should be prefixed with the vendor name,
not "linux,", as Linus Walleij pointed out. Change this and document the
bindings of this platform device.
We didn't ship the old binding in a release yet. So we can still change
it without breaking an official API.
Fixes: 380b1e2f3a ("gpio-exar/8250-exar: Make set of exported GPIOs configurable")
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
The register of setting back ratio should be RT5665_ADDA_CLK_2
instead of RT5665_ADDA_CLK_1.
Signed-off-by: Bard Liao <bardliao@realtek.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
In an ideal world, CNTFRQ_EL0 always contains the timer frequency
for the kernel to use. Sadly, we get quite a few broken systems
where the firmware authors cannot be bothered to program that
register on all CPUs, and rely on DT to provide that frequency.
So when trapping CNTFRQ_EL0, make sure to return the actual rate
(as known by the kernel), and not CNTFRQ_EL0.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
The HPET resume path abuses irq_domain_[de]activate_irq() to restore the
MSI message in the HPET chip for the boot CPU on resume and it relies on an
implementation detail of the interrupt core code, which magically makes the
HPET unmask call invoked via a irq_disable/enable pair. This worked as long
as the irq code did unconditionally invoke the unmask() callback. With the
recent changes which keep track of the masked state to avoid expensive
hardware access, this does not longer work. As a consequence the HPET timer
interrupts are not unmasked which breaks resume as the boot CPU waits
forever that a timer interrupt arrives.
Make the restore of the MSI message explicit and invoke the unmask()
function directly. While at it get rid of the pointless affinity setting as
nothing can change the affinity of the interrupt and the vector across
suspend/resume. The restore of the MSI message reestablishes the previous
affinity setting which is the correct one.
Fixes: bf22ff45be ("genirq: Avoid unnecessary low level irq function calls")
Reported-and-tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com>
Reported-by: Martin Peres <martin.peres@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: jeffy.chen@rock-chips.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707312158590.2287@nanos
In the function gpio_set_wake_irq(), port->irq_high is only checked for
zero. As platform_get_irq() returns a value less then zero if no interrupt
was found, any gpio >= 16 was handled like an irq_high interrupt was
available. On iMX27 for example no high interrupt is available. This lead
to the problem that only some gpios (the lower 16) were useable as wake
sources.
Signed-off-by: Philipp Rosenberger <p.rosenberger@linutronix.de>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
This driver calls irq_domain_hierarchy() and irq_chip_*_parent().
They are available only when CONFIG_IRQ_DOMAIN_HIERARCHY is enabled.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Drivers that need IRQ_DOMAIN_HIERARCHY should "select" it, but
drivers/pinctrl/stm32/Kconfig is the only exception that uses
"depends on" syntax. This prevents GPIO drivers from select'ing
IRQ_DOMAIN_HIERARCHY.
For example, if I add "select IRQ_DOMAIN_HIERARCHY" to GPIO_XGENE_SB,
I get the following recursive dependency error.
drivers/gpio/Kconfig:13:error: recursive dependency detected!
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/gpio/Kconfig:13: symbol GPIOLIB is selected by PINCTRL_STM32
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/pinctrl/stm32/Kconfig:3: symbol PINCTRL_STM32 is selected by PINCTRL_STM32F429
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/pinctrl/stm32/Kconfig:11: symbol PINCTRL_STM32F429 depends on IRQ_DOMAIN_HIERARCHY
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
kernel/irq/Kconfig:67: symbol IRQ_DOMAIN_HIERARCHY is selected by GPIO_XGENE_SB
For a resolution refer to Documentation/kbuild/kconfig-language.txt
subsection "Kconfig recursive dependency limitations"
drivers/gpio/Kconfig:502: symbol GPIO_XGENE_SB depends on GPIOLIB
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Pull networking fixes from David Miller:
1) Handle notifier registry failures properly in tun/tap driver, from
Tonghao Zhang.
2) Fix bpf verifier handling of subtraction bounds and add a testcase
for this, from Edward Cree.
3) Increase reset timeout in ftgmac100 driver, from Ben Herrenschmidt.
4) Fix use after free in prd_retire_rx_blk_timer_exired() in AF_PACKET,
from Cong Wang.
5) Fix SElinux regression due to recent UDP optimizations, from Paolo
Abeni.
6) We accidently increment IPSTATS_MIB_FRAGFAILS in the ipv6 code
paths, fix from Stefano Brivio.
7) Fix some mem leaks in dccp, from Xin Long.
8) Adjust MDIO_BUS kconfig deps to avoid build errors, from Arnd
Bergmann.
9) Mac address length check and buffer size fixes from Cong Wang.
10) Don't leak sockets in ipv6 udp early demux, from Paolo Abeni.
11) Fix return value when copy_from_user() fails in
bpf_prog_get_info_by_fd(), from Daniel Borkmann.
12) Handle PHY_HALTED properly in phy library state machine, from
Florian Fainelli.
13) Fix OOPS in fib_sync_down_dev(), from Ido Schimmel.
14) Fix truesize calculation in virtio_net which led to performance
regressions, from Michael S Tsirkin.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
samples/bpf: fix bpf tunnel cleanup
udp6: fix jumbogram reception
ppp: Fix a scheduling-while-atomic bug in del_chan
Revert "net: bcmgenet: Remove init parameter from bcmgenet_mii_config"
virtio_net: fix truesize for mergeable buffers
mv643xx_eth: fix of_irq_to_resource() error check
MAINTAINERS: Add more files to the PHY LIBRARY section
ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev()
net: phy: Correctly process PHY_HALTED in phy_stop_machine()
sunhme: fix up GREG_STAT and GREG_IMASK register offsets
bpf: fix bpf_prog_get_info_by_fd to dump correct xlated_prog_len
tcp: avoid bogus gcc-7 array-bounds warning
net: tc35815: fix spelling mistake: "Intterrupt" -> "Interrupt"
bpf: don't indicate success when copy_from_user fails
udp6: fix socket leak on early demux
net: thunderx: Fix BGX transmit stall due to underflow
Revert "vhost: cache used event for better performance"
team: use a larger struct for mac address
net: check dev->addr_len for dev_set_mac_address()
phy: bcm-ns-usb3: fix MDIO_BUS dependency
...
test_tunnel_bpf.sh fails to remove the vxlan11 tunnel device, causing the
next geneve tunnelling test case fails. In addition, the geneve reserved bit
in tcbpf2_kern.c should be zero, according to the RFC.
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 67a51780ae ("ipv6: udp: leverage scratch area
helpers") udp6_recvmsg() read the skb len from the scratch area,
to avoid a cache miss.
But the UDP6 rx path support RFC 2675 UDPv6 jumbograms, and their
length exceeds the 16 bits available in the scratch area. As a side
effect the length returned by recvmsg() is:
<ingress datagram len> % (1<<16)
This commit addresses the issue allocating one more bit in the
IP6CB flags field and setting it for incoming jumbograms.
Such field is still in the first cacheline, so at recvmsg()
time we can check it and fallback to access skb->len if
required, without a measurable overhead.
Fixes: 67a51780ae ("ipv6: udp: leverage scratch area helpers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>