This is a follow-up to the y2038 syscall patches already merged in the tip
tree. As the final 32-bit RISC-V syscall ABI is still being decided on,
this is the last chance to make a few corrections to leave out interfaces
based on 32-bit time_t along with the old off_t and rlimit types.
The series achieves this in a few steps:
- A couple of bug fixes for minor regressions I introduced
in the original series
- A couple of older patches from Yury Norov that I had never
merged in the past, these fix up the openat/open_by_handle_at and
getrlimit/setrlimit syscalls to disallow the old versions of off_t
and rlimit.
- Hiding the deprecated system calls behind an #ifdef in
include/uapi/asm-generic/unistd.h
- Change arch/riscv to drop all these ABIs.
Originally, the plan was to also leave these out on C-Sky, but that now
has a glibc port that uses the older interfaces, so we need to leave
them in place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcdEhGAAoJEGCrR//JCVInQuUQAN+mRFzRXAqhbpb63/vYGJei
nmDqB+SoxzaIKAIGAVIdMGUoFxBrY1oyS4m6/a9lzQ9G4aSkr0PruZnUID+vIo2h
rj+3FBlB/c9nvW+NG8iEtVadlRbTmoRILCWpvgIuLNd6fwvNzP3V4uu6a1QRIMx4
aUCWQfhzv18kW1EAPIroPA1gEL2HKbhDdEuN2V0SKnsKNiWkHQeswWQFAYpLgT36
eZ+L52lh+miEdtBxycxJ5lh3KsWO4dPImh+QHONZgeB9iS8v47K0R6ONKm4NMeQV
5KW55pepUq1uQUdEU9KRrh2krMih2IJbOQoN2lvb2ao5UG6erHbj0N55RQym5gSC
+TrvP3dnqfohh9hWdHDwME+5OTeOM+8SUMRnaZBJKuywzo7W1ceLpf+KZjwlk2s5
AgEX67fKrUbtBfTgVhzlYhJLWcgSD1yt64ed5SF15c5M3JZhkK8cd50dB9pM2/YB
o9VbijkYwb2KyCNUiV3nghgiiqcROvOIO7PK6z3XFFiRm/Gn2CgNZyZa7c4+Vgrr
PM/DmDvCdFqYnqBOlV2ilCLigKGN0JgwzMXnbQU77d71Yg7Bco8e/yqSucSilp2d
lEv44extu9FINWXIqvWEjRqdSq+sNgj21VSp6Zu/GaTgNCQKac2wsAZtnQgnslko
knKwwp525fjqnJEDd1aH
=/iFA
-----END PGP SIGNATURE-----
Merge tag 'y2038-syscall-abi' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull additional syscall ABI cleanup for y2038 from Arnd Bergmann:
This is a follow-up to the y2038 syscall patches already merged in the tip
tree. As the final 32-bit RISC-V syscall ABI is still being decided on,
this is the last chance to make a few corrections to leave out interfaces
based on 32-bit time_t along with the old off_t and rlimit types.
The series achieves this in a few steps:
- A couple of bug fixes for minor regressions I introduced
in the original series
- A couple of older patches from Yury Norov that I had never
merged in the past, these fix up the openat/open_by_handle_at and
getrlimit/setrlimit syscalls to disallow the old versions of off_t
and rlimit.
- Hiding the deprecated system calls behind an #ifdef in
include/uapi/asm-generic/unistd.h
- Change arch/riscv to drop all these ABIs.
Originally, the plan was to also leave these out on C-Sky, but that now
has a glibc port that uses the older interfaces, so we need to leave
them in place.
In
c7d606f560 ("x86/mce: Improve error message when kernel cannot recover")
a case was added for a machine check caused by a DATA access to poison
memory from the kernel. A case should have been added also for an
uncorrectable error during an instruction fetch in the kernel.
Add that extra case so the error message now reads:
mce: [Hardware Error]: Machine check: Instruction fetch error in kernel
Fixes: c7d606f560 ("x86/mce: Improve error message when kernel cannot recover")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190225205940.15226-1-tony.luck@intel.com
This was caught while staring at the whole {set,get}_fs() machinery.
It's last user, the 32-bit version of strnlen_user() went away with
5723aa993d ("x86: use the new generic strnlen_user() function")
so drop it.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: "Tobin C. Harding" <tobin@kernel.org>
Link: https://lkml.kernel.org/r/20190225191109.7671-1-bp@alien8.de
When calling __put_user(foo(), ptr), the __put_user() macro would call
foo() in between __uaccess_begin() and __uaccess_end(). If that code
were buggy, then those bugs would be run without SMAP protection.
Fortunately, there seem to be few instances of the problem in the
kernel. Nevertheless, __put_user() should be fixed to avoid doing this.
Therefore, evaluate __put_user()'s argument before setting AC.
This issue was noticed when an objtool hack by Peter Zijlstra complained
about genregs_get() and I compared the assembly output to the C source.
[ bp: Massage commit message and fixed up whitespace. ]
Fixes: 11f1a4b975 ("x86: reorganize SMAP handling in user space accesses")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20190225125231.845656645@infradead.org
This reverts commit 9da3f2b740.
It was well-intentioned, but wrong. Overriding the exception tables for
instructions for random reasons is just wrong, and that is what the new
code did.
It caused problems for tracing, and it caused problems for strncpy_from_user(),
because the new checks made perfectly valid use cases break, rather than
catch things that did bad things.
Unchecked user space accesses are a problem, but that's not a reason to
add invalid checks that then people have to work around with silly flags
(in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
odd way to say "this commit was wrong" and was sprinked into random
places to hide the wrongness).
The real fix to unchecked user space accesses is to get rid of the
special "let's not check __get_user() and __put_user() at all" logic.
Make __{get|put}_user() be just aliases to the regular {get|put}_user()
functions, and make it impossible to access user space without having
the proper checks in places.
The raison d'être of the special double-underscore versions used to be
that the range check was expensive, and if you did multiple user
accesses, you'd do the range check up front (like the signal frame
handling code, for example). But SMAP (on x86) and PAN (on ARM) have
made that optimization pointless, because the _real_ expense is the "set
CPU flag to allow user space access".
Do let's not break the valid cases to catch invalid cases that shouldn't
even exist.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.
The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.
However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the address range calculation for file-based filters works as
long as the vma that maps the matching part of the object file starts
from offset zero into the file (vm_pgoff==0). Otherwise, the resulting
filter range would be off by vm_pgoff pages. Another related problem is
that in case of a partially matching vma, that is, a vma that matches
part of a filter region, the filter range size wouldn't be adjusted.
Fix the arithmetics around address filter range calculations, taking
into account vma offset, so that the entire calculation is done before
the filter configuration is passed to the PMU drivers instead of having
those drivers do the final bit of arithmetics.
Based on the patch by Adrian Hunter <adrian.hunter.intel.com>.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Tested-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/20190215115655.63469-3-alexander.shishkin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Previously, commit 7dcd575520 ("x86/kvm/mmu: check if tdp/shadow
MMU reconfiguration is needed") offered some optimization to avoid
the unnecessary reconfiguration. Yet one scenario is broken - when
cpuid changes VM's maximum physical address width, reconfiguration
is needed to reset the reserved bits. Also, the TDP may need to
reset its shadow_root_level when this value is changed.
To fix this, a new field, maxphyaddr, is introduced in the extended
role structure to keep track of the configured guest physical address
width.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, 'commit 372fddf709 ("x86/mm: Introduce the 'no5lvl' kernel
parameter")' cleared X86_FEATURE_LA57 in boot_cpu_data, if Linux chooses
to not run in 5-level paging mode. Yet boot_cpu_data is queried by
do_cpuid_ent() as the host capability later when creating vcpus, and Qemu
will not be able to detect this feature and create VMs with LA57 feature.
As discussed earlier, VMs can still benefit from extended linear address
width, e.g. to enhance features like ASLR. So we would like to fix this,
by return the true hardware capability when Qemu queries.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 14c07ad89f ("x86/kvm/mmu: introduce guest_mmu") brought one subtle
change: previously, when switching back from L2 to L1, we were resetting
MMU hooks (like mmu->get_cr3()) in kvm_init_mmu() called from
nested_vmx_load_cr3() and now we do that in nested_ept_uninit_mmu_context()
when we re-target vcpu->arch.mmu pointer.
The change itself looks logical: if nested_ept_init_mmu_context() changes
something than nested_ept_uninit_mmu_context() restores it back. There is,
however, one thing: the following call chain:
nested_vmx_load_cr3()
kvm_mmu_new_cr3()
__kvm_mmu_new_cr3()
fast_cr3_switch()
cached_root_available()
now happens with MMU hooks pointing to the new MMU (root MMU in our case)
while previously it was happening with the old one. cached_root_available()
tries to stash current root but it is incorrect to read current CR3 with
mmu->get_cr3(), we need to use old_mmu->get_cr3() which in case we're
switching from L2 to L1 is guest_mmu. (BTW, in shadow page tables case this
is a non-issue because we don't switch MMU).
While we could've tried to guess that we're switching between MMUs and call
the right ->get_cr3() from cached_root_available() this seems to be overly
complicated. Instead, just stash the corresponding CR3 when setting
root_hpa and make cached_root_available() use the stashed value.
Fixes: 14c07ad89f ("x86/kvm/mmu: introduce guest_mmu")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The invariant TSC bit has the following meaning:
"The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processor's support for invariant TSC
is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run
at a constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead
of ACPI or HPET timers). TSC reads are much more efficient and do not
incur the overhead associated with a ring transition or access to a
platform resource."
IOW, TSC does not change frequency. In such case, and with
TSC scaling hardware available to handle migration, it is possible
to use the TSC clocksource directly, whose system calls are
faster.
Reduce the rating of kvmclock clocksource to allow TSC clocksource
to be the default if invariant TSC is exposed.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
v2: Use feature bits and tsc_unstable() check (Sean Christopherson)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...via a new helper, __kvm_mmu_zap_all(). An alternative to passing a
'bool mmio_only' would be to pass a callback function to filter the
shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but
zapping all shadow pages is a last resort, i.e. making the helper less
extensible is a feature of sorts. And the explicit MMIO parameter makes
it easy to preserve the WARN_ON_ONCE() if a restart is triggered when
zapping MMIO sptes.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a
quadratic runtime[1], i.e. restarting the spte walk while zapping only
MMIO sptes could result in re-walking large portions of the list over
and over due to the non-MMIO sptes encountered before the restart not
being removed.
At the time, the concern was legitimate as the walk was restarted when
any spte was zapped. But that is no longer the case as the walk is now
restarted iff one or more children have been zapped, which is necessary
because zapping children makes the active_mmu_pages list unstable.
Furthermore, it should be impossible for an MMIO spte to have children,
i.e. zapping an MMIO spte should never result in zapping children. In
other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and
so should always execute in linear time. WARN if this assertion fails.
Although it should never be needed, leave the restart logic in place.
In normal operation, the cost is at worst an extra CMP+Jcc, and if for
some reason the list does become unstable, not restarting would likely
crash KVM, or worse, the kernel.
[1] https://patchwork.kernel.org/patch/10756589/#22452085
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The return value of kvm_mmu_prepare_zap_page() has evolved to become
overloaded to convey two separate pieces of information. 1) was at
least one page zapped and 2) has the list of MMU pages become unstable.
In it's original incarnation (as kvm_mmu_zap_page()), there was no
return value at all. Commit 0738541396 ("KVM: MMU: awareness of new
kvm_mmu_zap_page behaviour") added a return value in preparation for
commit 4731d4c7a0 ("KVM: MMU: out of sync shadow core"). Although
the return value was of type 'int', it was actually used as a boolean
to indicate whether or not active_mmu_pages may have become unstable due
to zapping children. Walking a list with list_for_each_entry_safe()
only protects against deleting/moving the current entry, i.e. zapping a
child page would break iteration due to modifying any number of entries.
Later, commit 60c8aec6e2 ("KVM: MMU: use page array in unsync walk")
modified mmu_zap_unsync_children() to return an approximation of the
number of children zapped. This was not intentional, it was simply a
side effect of how the code was written.
The unintented side affect was then morphed into an actual feature by
commit 77662e0028 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
path"), which modified kvm_mmu_change_mmu_pages() to use the number of
zapped pages when determining the number of MMU pages in use by the VM.
Finally, commit 54a4f0239f ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed") added the initial page to the
return value to make its behavior more consistent with what most users
would expect. Incorporating the initial parent page in the return value
of kvm_mmu_zap_page() breaks the original usage of restarting a list
walk on a non-zero return value to handle a potentially unstable list,
i.e. walks will unnecessarily restart when any page is zapped.
Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
return a boolean to indicate that the list may be unstable and move the
number of zapped children to a dedicated parameter. Since the majority
of callers to kvm_mmu_prepare_zap_page() don't care about either return
value, preserve the current definition of kvm_mmu_prepare_zap_page() by
making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page(). This
avoids having to update every call site and also provides cleaner code
for functions that only care about the number of pages zapped.
Fixes: 54a4f0239f ("KVM: MMU: make kvm_mmu_zap_page() return
the number of pages it actually freed")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1], now that all users of the fast invalidate
mechanism are gone.
This reverts commit 5304b8d37c.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call cond_resched_lock() when zapping all sptes to reschedule if needed
or to release and reacquire mmu_lock in case of contention. There is no
need to flush or zap when temporarily dropping mmu_lock as zapping all
sptes is done only when the owning userspace VMM has exited or when the
VM is being destroyed, i.e. there is no interplay with memslots or MMIO
generations to worry about.
Be paranoid and restart the walk if mmu_lock is dropped to avoid any
potential issues with consuming a stale iterator. The overhead in doing
so is negligible as at worst there will be a few root shadow pages at
the head of the list, i.e. the iterator is essentially the head of the
list already.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to guarantee forward progress. When zapped, root pages are marked
invalid and moved to the head of the active pages list until they are
explicitly freed. Theoretically, having unzappable root pages at the
head of the list could prevent kvm_mmu_zap_all() from making forward
progress were a future patch to add a loop restart after processing a
page, e.g. to drop mmu_lock on contention.
Although kvm_mmu_prepare_zap_page() can theoretically take action on
invalid pages, e.g. to zap unsync children, functionally it's not
necessary (root pages will be re-zapped when freed) and practically
speaking the odds of e.g. @unsync or @unsync_children becoming %true
while zapping all pages is basically nil.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
Flushing all shadow entries is only done during VM teardown, i.e.
kvm_arch_flush_shadow_all() is only called when the associated MM struct
is being released or when the VM instance is being freed.
Although the performance of teardown itself isn't critical, KVM should
still voluntarily schedule to play nice with the rest of the kernel;
but that can be done without the fast invalidate mechanism in a future
patch.
This reverts commit 6ca18b6950.
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].
This reverts commit 2248b02321.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
is one part of a revert all patches from the series that introduced the
mechanism[1].
This reverts commit 35006126f0.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].
This reverts commit e7d11c7a89.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].
This reverts commit f34d251d66.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unwinding optimizations related to obsolete pages is a step towards
removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
a revert all patches from the series that introduced the mechanism[1].
This reverts commit 365c886860.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unwinding usage of is_obsolete() is a step towards removing x86's fast
invalidate mechanism, i.e. this is one part of a revert all patches from
the series that introduced the mechanism[1].
This is a partial revert of commit 05988d728d ("KVM: MMU: reduce
KVM_REQ_MMU_RELOAD when root page is zapped").
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
release and reacquire mmu_lock in case of contention. There is no need
to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
is done when holding the memslots lock and with the "update in-progress"
bit set in the memslots generation, which disables MMIO spte caching.
The walk does need to be restarted if mmu_lock is dropped as the active
pages list may be modified.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Revert back to a dedicated (and slower) mechanism for handling the
scenario where all MMIO shadow PTEs need to be zapped due to overflowing
the MMIO generation number. The MMIO generation scenario is almost
literally a one-in-a-million occurrence, i.e. is not a performance
sensitive scenario.
Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
the fast invalidate mechanism altogether.
This reverts commit a8eca9dcc6.
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
from the original series[1].
Though not explicitly stated, for all intents and purposes the fast
invalidate mechanism was added to speed up the scenario where removing
a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
flush all shadow entries[1]. Now that the memslot case flushes only
shadow entries belonging to the memslot, i.e. doesn't use the fast
invalidate mechanism, the only remaining usage of the mechanism are
when the VM is being destroyed and when the MMIO generation rolls
over.
When a VM is being destroyed, either there are no active vcpus, i.e.
there's no lock contention, or the VM has ungracefully terminated, in
which case we want to reclaim its pages as quickly as possible, i.e.
not release the MMU lock if there are still CPUs executing in the VM.
The MMIO generation scenario is almost literally a one-in-a-million
occurrence, i.e. is not a performance sensitive scenario.
Given that lock-breaking is not desirable (VM teardown) or irrelevant
(MMIO generation overflow), remove the fast invalidate mechanism to
simplify the code (a small amount) and to discourage future code from
zapping all pages as using such a big hammer should be a last restort.
This reverts commit f6f8adeef5.
[1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
that actually belong to the memslot being removed. This improves
performance, especially why the deleted memslot has only a few shadow
entries, or even no entries. E.g. a microbenchmark to access regular
memory while concurrently reading PCI ROM to trigger memslot deletion
showed a 5% improvement in throughput.
Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
not require a vcpu so that the code can be (re)used by
kvm_mmu_invalidate_zap_pages_in_memslot().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
helpers in future patches.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that KVM won't explode by moving it out of bit 0. Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The code to propagate the memslots generation number into MMIO sptes is
a bit convoluted. The "what" is relatively straightfoward, e.g. the
comment explaining which bits go where is quite readable, but the "how"
requires a lot of staring to understand what is happening. For example,
'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.
Refactor the code to:
- use #defines whose values align with the bits defined in the comment
- use consistent code for both the high and low mask
- explicitly highlight the handling of bit 0 (update in-progress flag)
- explicitly call out that the defines are for MMIO sptes (to avoid
confusion with the per-vCPU MMIO cache, which uses the full memslots
generation)
In addition to making the code a little less magical, this paves the way
for moving the update in-progress flag to bit 63 without having to
simultaneously rewrite all of the MMIO spte code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM currently uses an 'unsigned int' for the MMIO generation number
despite it being derived from the 64-bit memslots generation and
being propagated to (potentially) 64-bit sptes. There is no hidden
agenda behind using an 'unsigned int', it's done simply because the
MMIO generation will never set bits above bit 19.
Passing a u64 will allow the "update in-progress" flag to be relocated
from bit 0 to bit 63 and removes the need to cast the generation back
to a u64 when propagating it to a spte.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.
Prior to commit 4bd518f159 ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...
Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.
Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When installing new memslots, KVM sets bit 0 of the generation number to
indicate that an update is in-progress. Until the update is complete,
there are no guarantees as to whether a vCPU will see the old or the new
memslots. Explicity prevent caching MMIO accesses so as to avoid using
an access cached from the old memslots after the new memslots have been
installed.
Note that it is unclear whether or not disabling caching during the
update window is strictly necessary as there is no definitive
documentation as to what ordering guarantees KVM provides with respect
to updating memslots. That being said, the MMIO spte code does not
allow reusing sptes created while an update is in-progress, and the
associated documentation explicitly states:
We do not want to use an MMIO sptes created with an odd generation
number, ... If KVM is unlucky and creates an MMIO spte while the
low bit is 1, the next access to the spte will always be a cache miss.
At the very least, disabling the per-vCPU MMIO cache during updates will
make its behavior consistent with the MMIO spte behavior and
documentation.
Fixes: 56f17dd3fb ("kvm: x86: fix stale mmio cache bug")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check to detect a wrap of the MMIO generation explicitly looks for a
generation number of zero. Now that unique memslots generation numbers
are assigned to each address space, only address space 0 will get a
generation number of exactly zero when wrapping. E.g. when address
space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
wrap to 0x2. Adjust the MMIO generation to strip the address space
modifier prior to checking for a wrap.
Fixes: 4bd518f159 ("KVM: use separate generations for each address space")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.
Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.
Fixes: e59dbe09f8 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.
Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.
Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.
Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.
There remain a few allocations which should be charged to the VM's
cgroup but are not. In x86, they include:
vcpu->arch.pio_data
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.
Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The preemption timer can be started even if there is a vmentry
failure during or after loading guest state. That is pointless,
move the call after all conditions have been checked.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, 'commit f99e3daf94 ("KVM: x86: Add Intel PT
virtualization work mode")' work mode' offered framework
to support Intel PT virtualization. However, the patch has
some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
used wrong flags and wrong variable, which will cause the
VM entry failure later.
Fixes: 'commit f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")'
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ensure that the VCPU free path goes through vmx_leave_nested and
thus nested_vmx_vmexit, so that the cancellation of the timer does
not have to be in free_nested. In addition, because some paths through
nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
the timer is moved there.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.
The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
but not running on physical CPU). If a posted interrupt coming at this
time, the irq remmaping facility will set the bit of PIR (Posted
Interrupt Requests) without ON (Outstanding Notification).
So this interrupt can't be sync to APIC virtualization register and
will not be handled by Guest because ON is zero.
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Eliminate the pi_clear_sn fast path. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MOVDIR64B moves 64-bytes as direct-store with 64-bytes write atomicity.
Direct store is implemented by using write combining (WC) for writing
data directly into memory without caching the data.
Availability of the MOVDIR64B instruction is indicated by the presence
of the CPUID feature flag MOVDIR64B (CPUID.0x07.0x0:ECX[bit 28]).
This patch exposes the movdir64b feature to the guest.
The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf
Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MOVDIRI moves doubleword or quadword from register to memory through
direct store which is implemented by using write combining (WC) for
writing data directly into memory without caching the data.
Availability of the MOVDIRI instruction is indicated by the presence of
the CPUID feature flag MOVDIRI(CPUID.0x07.0x0:ECX[bit 27]).
This patch exposes the movdiri feature to the guest.
The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf
Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD's SME/SEV is no longer the only case which reduces supported
physical address bits, since Intel introduced Multi-key Total Memory
Encryption (MKTME), which repurposes high bits of physical address as
keyID, thus effectively shrinks supported physical address bits. To
cover both cases (and potential similar future features), kernel MM
introduced generic dynamaic physical address mask instead of hard-coded
__PHYSICAL_MASK in 'commit 94d49eb30e ("x86/mm: Decouple dynamic
__PHYSICAL_MASK from AMD SME")'. KVM should use that too.
Change PT64_BASE_ADDR_MASK to use kernel dynamic physical address mask
when it is enabled, instead of sme_clr. PT64_DIR_BASE_ADDR_MASK is also
deleted since it is not used at all.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX is only accessible in protected mode, remove a confusing check
that causes the conditional to lack a final "else" branch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Regarding segments with a limit==0xffffffff, the SDM officially states:
When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
or may not cause the indicated exceptions. Behavior is
implementation-specific and may vary from one execution to another.
In practice, all CPUs that support VMX ignore limit checks for "flat
segments", i.e. an expand-up data or code segment with base=0 and
limit=0xffffffff. This is subtly different than wrapping the effective
address calculation based on the address size, as the flat segment
behavior also applies to accesses that would wrap the 4g boundary, e.g.
a 4-byte access starting at 0xffffffff will access linear addresses
0xffffffff, 0x0, 0x1 and 0x2.
Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The address size of an instruction affects the effective address, not
the virtual/linear address. The final address may still be truncated,
e.g. to 32-bits outside of long mode, but that happens irrespective of
the address size, e.g. a 32-bit address size can yield a 64-bit virtual
address when using FS/GS with a non-zero base.
Fixes: 064aea7747 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
operands for various instructions, including VMX instructions, as a
naturally sized unsigned value, but masks the value by the addr size,
e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
reported as 0xffffffd8 for a 32-bit address size. Despite some weird
wording regarding sign extension, the SDM explicitly states that bits
beyond the instructions address size are undefined:
In all cases, bits of this field beyond the instruction’s address
size are undefined.
Failure to sign extend the displacement results in KVM incorrectly
treating a negative displacement as a large positive displacement when
the address size of the VMX instruction is smaller than KVM's native
size, e.g. a 32-bit address size on a 64-bit KVM.
The very original decoding, added by commit 064aea7747 ("KVM: nVMX:
Decoding memory operands of VMX instructions"), sort of modeled sign
extension by truncating the final virtual/linear address for a 32-bit
address size. I.e. it messed up the effective address but made it work
by adjusting the final address.
When segmentation checks were added, the truncation logic was kept
as-is and no sign extension logic was introduced. In other words, it
kept calculating the wrong effective address while mostly generating
the correct virtual/linear address. As the effective address is what's
used in the segment limit checks, this results in KVM incorreclty
injecting #GP/#SS faults due to non-existent segment violations when
a nested VMM uses negative displacements with an address size smaller
than KVM's native address size.
Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
KVM using 0x100000fd8 as the effective address when checking for a
segment limit violation. This causes a 100% failure rate when running
a 32-bit KVM build as L1 on top of a 64-bit KVM L0.
Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function svm_refresh_apicv_exec_ctrl() always returning prematurely
as kvm_vcpu_apicv_active() always return false when calling from
the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
This is because the apicv_active is set to false just before calling
refresh_apicv_exec_ctrl().
Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.
So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.
Fixes: 67034bb9dd ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently apicv_active can be true even if in-kernel LAPIC
emulation is disabled. Avoid this by properly initializing
it in kvm_arch_vcpu_init, and then do not do anything to
deactivate APICv when it is actually not used
(Currently APICv is only deactivated by SynIC code that in turn
is only reachable when in-kernel LAPIC is in use. However, it is
cleaner if kvm_vcpu_deactivate_apicv avoids relying on this.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Current SVM AVIC driver makes two incorrect assumptions:
1. APIC LDR register cannot be zero
2. APIC DFR for all vCPUs must be the same
LDR=0 means the local APIC does not support logical destination mode.
Therefore, the driver should mark any previously assigned logical APIC ID
table entry as invalid, and return success. Also, DFR is specific to
a particular local APIC, and can be different among all vCPUs
(as observed on Windows 10).
These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
table, handle DFR and LDR for each vCPU independently.
Fixes: 18f40c53e1 ('svm: Add VMEXIT handlers for AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMs may show incorrect uptime and dmesg printk offsets on hypervisors with
unstable clock. The problem is produced when VM is rebooted without exiting
from qemu.
The fix is to calculate clock offset not only for stable clock but for
unstable clock as well, and use kvm_sched_clock_read() which substracts
the offset for both clocks.
This is safe, because pvclock_clocksource_read() does the right thing and
makes sure that clock always goes forward, so once offset is calculated
with unstable clock, we won't get new reads that are smaller than offset,
and thus won't get negative results.
Thank you Jon DeVree for helping to reproduce this issue.
Fixes: 857baa87b6 ("sched/clock: Enable sched clock early")
Cc: stable@vger.kernel.org
Reported-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the clearing of the common registers (not 64-bit-only) to the start
of the flow that clears registers holding guest state. This is
purely a cosmetic change so that the label doesn't point at a blank line
and a #define.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that the sub-routine follows standard calling conventions.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to make it callable from C code.
Note that because KVM chooses to be ultra paranoid about guest register
values, all callee-save registers are still cleared after VM-Exit even
though the host's values are now reloaded from the stack.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the assembly sub-routine callable from C code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the sub-routine callable from C code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for making the sub-routine callable from C code. That
means returning the result in RAX. Since RAX will be used to return the
result, use it as the scratch register as well to make the code readable
and to document that the scratch register is more or less arbitrary.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that the name is no longer usurped by a defunct helper function.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that the code is no longer tagged with STACK_FRAME_NON_STANDARD.
Arguably, providing __vmx_vcpu_run() to break up vmx_vcpu_run() is
valuable on its own, but the previous split was purposely made as small
as possible to limit the effects STACK_FRAME_NON_STANDARD. In other
words, the current split is now completely arbitrary and likely not the
most logical.
This also allows renaming ____vmx_vcpu_run() to __vmx_vcpu_run() in a
future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As evidenced by the myriad patches leading up to this moment, using
an inline asm blob for vCPU-run is nothing short of horrific. It's also
been called "unholy", "an abomination" and likely a whole host of other
names that would violate the Code of Conduct if recorded here and now.
The code is relocated nearly verbatim, e.g. quotes, newlines, tabs and
__stringify need to be dropped, but other than those cosmetic changes
the only functional changees are to add the "call" and replace the final
"jmp" with a "ret".
Note that STACK_FRAME_NON_STANDARD is also dropped from __vmx_vcpu_run().
Suggested-by: Andi Kleen <ak@linux.intel.com>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...in preparation for moving to a proper assembly sub-routnine.
vCPU-run isn't a leaf function since it calls vmx_update_host_rsp()
and vmx_vmenter(). And since we need to save/restore RBP anyways,
unconditionally creating the frame costs a single MOV, i.e. don't
bother keying off CONFIG_FRAME_POINTER or using FRAME_BEGIN, etc...
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for moving the inline asm to a proper asm sub-routine.
Eliminating the immediates allows a nearly verbatim move, e.g. quotes,
newlines, tabs and __stringify need to be dropped, but other than those
cosmetic changes the only function change will be to replace the final
"jmp" with a "ret".
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Declaring the VCPU_REGS_* as enums allows for more robust C code, but it
prevents using the values in assembly files. Expliciting #define the
indices in an asm-friendly file to prepare for VMX moving its transition
code to a proper assembly file, but keep the enums for general usage.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This API is primarily used through DT entries, but two architectures
and two drivers call it directly. So instead of selecting the config
symbol for random architectures pull it in implicitly for the actual
users. Also rename the Kconfig option to describe the feature better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS
Acked-by: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All new 32-bit architectures should have 64-bit userspace off_t type, but
existing architectures has 32-bit ones.
To enforce the rule, new config option is added to arch/Kconfig that defaults
ARCH_32BIT_OFF_T to be disabled for new 32-bit architectures. All existing
32-bit architectures enable it explicitly.
New option affects force_o_largefile() behaviour. Namely, if userspace
off_t is 64-bits long, we have no reason to reject user to open big files.
Note that even if architectures has only 64-bit off_t in the kernel
(arc, c6x, h8300, hexagon, nios2, openrisc, and unicore32),
a libc may use 32-bit off_t, and therefore want to limit the file size
to 4GB unless specified differently in the open flags.
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yury Norov <ynorov@marvell.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Use linux/mman.h to make sure we get all mmap flags we need.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The 'phys_id' local variable became unused after commit
ce4b1b1650 ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it").
Remove it.
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/1550495101-41755-1-git-send-email-zhangshaokun@hisilicon.com
When limiting memory size via kernel parameter "mem=" this should be
respected even in case of memory made accessible via a PCI card.
Today this kind of memory won't be made usable in initial memory
setup as the memory won't be visible in E820 map, but it might be
added when adding PCI devices due to corresponding ACPI table entries.
Not respecting "mem=" can be corrected by adding a global max_mem_size
variable set by parse_memopt() which will result in rejecting adding
memory areas resulting in a memory size above the allowed limit.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Pull x86 fixes from Ingo Molnar:
"Three changes:
- An UV fix/quirk to pull UV BIOS calls into the efi_runtime_lock
locking regime. (This done by aliasing __efi_uv_runtime_lock to
efi_runtime_lock, which should make the quirk nature obvious and
maintain the general policy that the EFI lock (name...) isn't
exposed to drivers.)
- Our version of MAGA: Make a.out Great Again.
- Add a new Intel model name enumerator to an upstream header to help
reduce dependencies going forward"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/UV: Use efi_runtime_lock to serialise BIOS calls
x86/CPU: Add Icelake model number
x86/a.out: Clear the dump structure initially
Pull perf fixes from Ingo Molnar:
"Two fixes on the kernel side: fix an over-eager condition that failed
larger perf ring-buffer sizes, plus fix crashes in the Intel BTS code
for a corner case, found by fuzzing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix impossible ring-buffer sizes warning
perf/x86: Add check_period PMU callback
The following commit:
a893ea15d764 ("tpm: move tpm_chip definition to include/linux/tpm.h")
introduced a build error when both IMA and EFI are enabled:
In file included from ../security/integrity/ima/ima_fs.c:30:
../security/integrity/ima/ima.h:176:7: error: redeclaration of enumerator "NONE"
What happens is that both headers (ima.h and efi.h) defines the same
'NONE' constant, and it broke when they started getting included from
the same file:
Rework to prefix the EFI enum with 'EFI_*'.
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215165551.12220-2-ard.biesheuvel@linaro.org
[ Cleaned up the changelog a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The netfilter conflicts were rather simple overlapping
changes.
However, the cls_tcindex.c stuff was a bit more complex.
On the 'net' side, Cong is fixing several races and memory
leaks. Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.
What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU. I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous AMD systems have had a bit in MCA_STATUS to indicate that an
error was detected on a scrub operation. However, this bit was defined
differently within different banks and families/models.
Starting with Family 17h, MCA_STATUS[40] is either Reserved/Read-as-Zero
or defined as "Scrub", for all MCA banks and CPU models. Therefore, this
bit can be defined as the "Scrub" bit.
Define MCA_STATUS[40] as "Scrub" and decode it in the AMD MCE decoding
module for Family 17h and newer systems.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190212212417.107049-1-Yazen.Ghannam@amd.com
Commit ca83b4a7f2 ("x86/KVM/VMX: Add find_msr() helper function")
introduces the helper function find_msr(), which returns -ENOENT when
not find the msr in vmx->msr_autoload.guest/host. Correct checking contion
of no more available entry in vmx->msr_autoload.
Fixes: ca83b4a7f2 ("x86/KVM/VMX: Add find_msr() helper function")
Cc: stable@vger.kernel.org
Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.
The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but
not running on physical CPU). If a posted interrupt comes at this time,
the irq remapping facility will set the bit of PIR (Posted Interrupt
Requests) but not ON (Outstanding Notification). Then, the interrupt
will not be seen by KVM, which always expects PID.ON=1 if PID.PIR=1
as documented in the Intel processor SDM but not in the VT-d specification.
To fix this, restore the invariant after PID.SN is cleared.
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the CPUID model number of Icelake (ICL) mobile processors to the
Intel family list. Icelake U/Y series uses model number 0x7E.
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David E. Box" <david.e.box@intel.com>
Cc: dvhart@infradead.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: platform-driver-x86@vger.kernel.org
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190214115712.19642-2-rajneesh.bhardwaj@linux.intel.com
A recently added preemption timer consistency check was unintentionally
dropped when the consistency checks were being reorganized to match the
SDM's ordering.
Fixes: 461b4ba4c7 ("KVM: nVMX: Move the checks for VM-Execution Control Fields to a separate helper function")
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guenter Roeck reported triple faults of a 64-bit VM using a 32-bit OVMF
EFI image. After some singlestepping of the image in gdb, it turned out
that some of the EFI config tables were at bogus addresses. Which, as
Ard pointed out, results from using the wrong efi_config_table typedef.
So switch all EFI table pointers to unsigned longs and convert them to
the proper typedef only when accessing them. This way, the proper table
type is being used.
Shorten variable names, while at it.
Fixes: 33f0df8d84 ("x86/boot: Search for RSDP in the EFI tables")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: bhe@redhat.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190208190248.GA10854@roeck-us.net
dump_thread32() in aout_core_dump() does not clear the user32 structure
allocated on the stack as the first thing on function entry.
As a result, the dump.u_comm, dump.u_ar0 and dump.signal which get
assigned before the clearing, get overwritten.
Rename that function to fill_dump() to make it clear what it does and
call it first thing.
This was caught while staring at a patch by Derek Robson
<robsonde@gmail.com>.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Derek Robson <robsonde@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Matz <matz@suse.de>
Cc: x86@kernel.org
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190202005512.3144-1-robsonde@gmail.com
Prohibit probing on the functions called before kprobe_int3_handler()
in do_int3(). More specifically, ftrace_int3_handler(),
poke_int3_handler(), and ist_enter(). And since rcu_nmi_enter() is
called by ist_enter(), it also should be marked as NOKPROBE_SYMBOL.
Since those are handled before kprobe_int3_handler(), probing those
functions can cause a breakpoint recursion and crash the kernel.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/154998793571.31052.11301258949601150994.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move optprobe trampoline code into RODATA since it is
not executed, but copied and modified to be used on
a trampoline buffer.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/154998790744.31052.3016106262944915510.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prohibit probing on optprobe template code, since it is not
a code but a template instruction sequence. If we modify
this template, copied template must be broken.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 9326638cbe ("kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation")
Link: http://lkml.kernel.org/r/154998787911.31052.15274376330136234452.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SDM says MSR_IA32_VMX_PROCBASED_CTLS2 is only available "If
(CPUID.01H:ECX.[5] && IA32_VMX_PROCBASED_CTLS[63])". It was found that
some old cpus (namely "Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz (family: 0x6,
model: 0xf, stepping: 0x6") don't have it. Add the missing check.
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Tested-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...now that all other references to struct vcpu_vmx have been removed.
Note that 'vmx' still needs to be passed into the asm blob in _ASM_ARG1
as it is consumed by vmx_update_host_rsp(). And similar to that code,
use _ASM_ARG2 in the assembly code to prepare for moving to proper asm,
while explicitly referencing the exact registers in the clobber list for
clarity in the short term and to avoid additional precompiler games.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A failed VM-Enter (obviously) didn't succeed, meaning the CPU never
executed an instrunction in guest mode and so can't have changed the
general purpose registers.
In addition to saving some instructions in the VM-Fail case, this also
provides a separate path entirely and thus an opportunity to propagate
the fail condition to vmx->fail via register without introducing undue
pain. Using a register, as opposed to directly referencing vmx->fail,
eliminates the need to pass the offset of 'fail', which will simplify
moving the code to proper assembly in future patches.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Switching the ordering allows for an out-of-line path for VM-Fail
that elides saving guest state but still shares the register clearing
with the VM-Exit path.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and remove struct vcpu_vmx's temporary __launched variable.
Eliminating __launched is a bonus, the real motivation is to get to the
point where the only reference to struct vcpu_vmx in the asm code is
to vcpu.arch.regs, which will simplify moving the blob to a proper asm
file. Note that also means this approach is deliberately different than
what is used in nested_vmx_check_vmentry_hw().
Use BL as it is a callee-save register in both 32-bit and 64-bit ABIs,
i.e. it can't be modified by vmx_update_host_rsp(), to avoid having to
temporarily save/restore the launched flag.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Providing a helper function to update HOST_RSP is visibly easier to
read, and more importantly (for the future) eliminates two arguments to
the VM-Enter assembly blob. Reducing the number of arguments to the asm
blob is for all intents and purposes a prerequisite to moving the code
to a proper assembly routine. It's not truly mandatory, but it greatly
simplifies the future code, and the cost of the extra CALL+RET is
negligible in the grand scheme.
Note that although _ASM_ARG[1-3] can be used in the inline asm itself,
the intput/output constraints need to be manually defined. gcc will
actually compile with _ASM_ARG[1-3] specified as constraints, but what
it actually ends up doing with the bogus constraint is unknown.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to eliminate its parameter and struct vcpu_vmx offset definition
from the assembly blob. Accessing CR2 from C versus assembly doesn't
change the likelihood of taking a page fault (and modifying CR2) while
it's loaded with the guest's value, so long as we don't do anything
silly between accessing CR2 and VM-Enter/VM-Exit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, host_rsp is cached on a per-vCPU basis, i.e. it's stored in
struct vcpu_vmx. In non-nested usage the caching is for all intents
and purposes 100% effective, e.g. only the first VMLAUNCH needs to
synchronize VMCS.HOST_RSP since the call stack to vmx_vcpu_run() is
identical each and every time. But when running a nested guest, KVM
must invalidate the cache when switching the current VMCS as it can't
guarantee the new VMCS has the same HOST_RSP as the previous VMCS. In
other words, the cache loses almost all of its efficacy when running a
nested VM.
Move host_rsp to struct vmcs_host_state, which is per-VMCS, so that it
is cached on a per-VMCS basis and restores its 100% hit rate when
nested VMs are in play.
Note that the host_rsp cache for vmcs02 essentially "breaks" when
nested early checks are enabled as nested_vmx_check_vmentry_hw() will
see a different RSP at the time of its VM-Enter. While it's possible
to avoid even that VMCS.HOST_RSP synchronization, e.g. by employing a
dedicated VM-Exit stack, there is little motivation for doing so as
the overhead of two VMWRITEs (~55 cycles) is dwarfed by the overhead
of the extra VMX transition (600+ cycles) and is a proverbial drop in
the ocean relative to the total cost of a nested transtion (10s of
thousands of cycles).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and provide an explicit name for the constraint. Naming the input
constraint makes the code self-documenting and also avoids the fragility
of numerically referring to constraints, e.g. %4 breaks badly whenever
the constraints are modified.
Explicitly using RDX was inherited from vCPU-run, i.e. completely
arbitrary. Even vCPU-run doesn't truly need to explicitly use RDX, but
doing so is more robust as vCPU-run needs tight control over its
register usage.
Note that while the naming "conflict" between host_rsp and HOST_RSP
is slightly confusing, the former will be renamed slightly in a
future patch, at which point HOST_RSP is absolutely what is desired.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Temporarily propagating vmx->loaded_vmcs->launched to vmx->__launched
is not functionally necessary, but rather was done historically to
avoid passing both 'vmx' and 'loaded_vmcs' to the vCPU-run asm blob.
Nested early checks inherited this behavior by virtue of copy+paste.
A future patch will move HOST_RSP caching to be per-VMCS, i.e. store
'host_rsp' in loaded VMCS. Now that the reference to 'vmx->fail' is
also gone from nested early checks, referencing 'loaded_vmcs' directly
means we can drop the 'vmx' reference when introducing per-VMCS RSP
caching. And it means __launched can be dropped from struct vcpu_vmx
if/when vCPU-run receives similar treatment.
Note the use of a named register constraint for 'loaded_vmcs'. Using
RCX to hold 'vmx' was inherited from vCPU-run. In the vCPU-run case,
the scratch register needs to be explicitly defined as it is crushed
when loading guest state, i.e. deferring to the compiler would corrupt
the pointer. Since nested early checks never loads guests state, it's
a-ok to let the compiler pick any register. Naming the constraint
avoids the fragility of referencing constraints via %1, %2, etc.., which
breaks horribly when modifying constraints, and generally makes the asm
blob more readable.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to take advantage of __GCC_ASM_FLAG_OUTPUTS__ when possible.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike the primary vCPU-run flow, the nested early checks code doesn't
actually want to propagate VM-Fail back to 'vmx'. Yay copy+paste.
In additional to eliminating the need to clear vmx->fail before
returning, using a local boolean also drops a reference to 'vmx' in the
asm blob. Dropping the reference to 'vmx' will save a register in the
long run as future patches will shift all pointer references from 'vmx'
to 'vmx->loaded_vmcs'.
Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using %1 to reference RCX, i.e. the 'vmx' pointer', is obtuse and
fragile, e.g. it results in cryptic and infurating compile errors if the
output constraints are touched by anything more than a gentle breeze.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...as it doesn't technically actually do anything non-standard with the
stack even though it modifies RSP in a weird way. E.g. RSP is loaded
with VMCS.HOST_RSP if the VM-Enter gets far enough to trigger VM-Exit,
but it's simply reloaded with the current value.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
RAX is not touched by nested_vmx_check_vmentry_hw(), directly or
indirectly (e.g. vmx_vmenter()). Remove it from the clobber list.
Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Per commit c20363006a ("KVM: VMX: Let gcc to choose which registers
to save (x86_64)"), the only reason RDX is saved/loaded to/from the
stack is because it was specified as an input, i.e. couldn't be marked
as clobbered (ignoring the fact that "saving" it to a dummy output
would indirectly mark it as clobbered).
Now that RDX is no longer an input, clobber it.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Load RDX with the VMCS.HOST_RSP field encoding on-demand instead of
delegating to the compiler via an input constraint. In addition to
saving one whole MOV instruction, this allows RDX to be properly
clobbered (in a future patch) instead of being saved/loaded to/from
the stack.
Despite nested_vmx_check_vmentry_hw() having similar code, leave it
alone, for now. In that case, RDX is unconditionally used and isn't
clobbered, i.e. sending in HOST_RSP as an input is simpler.
Note that because HOST_RSP is an enum and not a define, it must be
redefined as an immediate instead of using __stringify(HOST_RSP). The
naming "conflict" between host_rsp and HOST_RSP is slightly confusing,
but the former will be removed in a future patch, at which point
HOST_RSP is absolutely what is desired.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
RSI is clobbered by the vCPU-run asm blob, but it's not marked as such,
probably because GCC doesn't let you mark inputs as clobbered. "Save"
RSI to a dummy output so that GCC recognizes it as being clobbered.
Fixes: 773e8a0425 ("x86/kvm: use Enlightened VMCS when running on Hyper-V")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the vCPU-run asm blob, the guest's RCX is temporarily saved onto the
stack after VM-Exit as the exit flow must first load a register with a
pointer to the vCPU's save area in order to save the guest's registers.
RCX is arbitrarily designated as the scratch register.
Since the stack usage is to (1)save host, (2)save guest, (3)load host
and (4)load guest, the code can't conform to the stack's natural FIFO
semantics, i.e. it can't simply do PUSH/POP. Regardless of whether it
is done for the host's value or guest's value, at some point the code
needs to access the stack using a non-traditional method, e.g. MOV
instead of POP. vCPU-run opts to create a placeholder on the stack for
guest's RCX (by adjusting RSP) and saves RCX to its place immediately
after VM-Exit (via MOV).
In other words, the purpose of the first 'PUSH RCX' at the start of
the vCPU-run asm blob is to adjust RSP down, i.e. there's no need to
actually access memory. Use 'SUB $wordsize, RSP' instead of 'PUSH RCX'
to make it more obvious that the intent is simply to create a gap on
the stack for the guest's RCX.
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...except RSP, which is restored by hardware as part of VM-Exit.
Paolo theorized that restoring registers from the stack after a VM-Exit
in lieu of zeroing them could lead to speculative execution with the
guest's values, e.g. if the stack accesses miss the L1 cache[1].
Zeroing XORs are dirt cheap, so just be ultra-paranoid.
Note that the scratch register (currently RCX) used to save/restore the
guest state is also zeroed as its host-defined value is loaded via the
stack, just with a MOV instead of a POP.
[1] https://patchwork.kernel.org/patch/10771539/#22441255
Fixes: 0cb5b30698 ("kvm: vmx: Scrub hardware GPRs at VM-exit")
Cc: <stable@vger.kernel.org>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nested early checks does a manual comparison of a VMCS' launched status
in its asm blob to execute the correct VM-Enter instruction, i.e.
VMLAUNCH vs. VMRESUME. The launched flag is a bool, which is a typedef
of _Bool. C99 does not define an exact size for _Bool, stating only
that is must be large enough to hold '0' and '1'. Most, if not all,
compilers use a single byte for _Bool, including gcc[1].
The use of 'cmpl' instead of 'cmpb' was not deliberate, but rather the
result of a copy-paste as the asm blob was directly derived from the asm
blob for vCPU-run.
This has not caused any known problems, likely due to compilers aligning
variables to 4-byte or 8-byte boundaries and KVM zeroing out struct
vcpu_vmx during allocation. I.e. vCPU-run accesses "junk" data, it just
happens to always be zero and so doesn't affect the result.
[1] https://gcc.gnu.org/ml/gcc-patches/2000-10/msg01127.html
Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The vCPU-run asm blob does a manual comparison of a VMCS' launched
status to execute the correct VM-Enter instruction, i.e. VMLAUNCH vs.
VMRESUME. The launched flag is a bool, which is a typedef of _Bool.
C99 does not define an exact size for _Bool, stating only that is must
be large enough to hold '0' and '1'. Most, if not all, compilers use
a single byte for _Bool, including gcc[1].
Originally, 'launched' was of type 'int' and so the asm blob used 'cmpl'
to check the launch status. When 'launched' was moved to be stored on a
per-VMCS basis, struct vcpu_vmx's "temporary" __launched flag was added
in order to avoid having to pass the current VMCS into the asm blob.
The new '__launched' was defined as a 'bool' and not an 'int', but the
'cmp' instruction was not updated.
This has not caused any known problems, likely due to compilers aligning
variables to 4-byte or 8-byte boundaries and KVM zeroing out struct
vcpu_vmx during allocation. I.e. vCPU-run accesses "junk" data, it just
happens to always be zero and so doesn't affect the result.
[1] https://gcc.gnu.org/ml/gcc-patches/2000-10/msg01127.html
Fixes: d462b81923 ("KVM: VMX: Keep list of loaded VMCSs, instead of vcpus")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.
The XSAVE hardware structure has bits that indicate when valid state
is present in registers unique to AVX-512 use. Use these bits to
indicate when AVX-512 has been in use and add per-task AVX-512 state
timestamp tracking to context switch.
Well-written AVX-512 applications are expected to clear the AVX-512
state when not actively using AVX-512 registers, so the tracking
mechanism is imprecise and can theoretically miss AVX-512 usage during
context switch. But it has been measured to be precise enough to be
useful under real-world workloads like tensorflow and linpack.
If higher precision is required, suggest user space tools to use the
PMU-based mechanisms in combination.
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: aubrey.li@intel.com
Link: http://lkml.kernel.org/r/20190117183822.31333-1-aubrey.li@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Thomas noticed that the new arch/x86/include/asm/cpu_device_id.h header is
a train-wreck that didn't incorporate review feedback like not using __u8
in kernel-only headers.
While at it also fix all the *other* problems this header has:
- Use canonical names for the header guards. It's inexplicable why a non-standard
guard was used.
- Don't define the header guard to 1. Plus annotate the closing #endif as done
absolutely every other header. Again, an inexplicable source of noise.
- Move the kernel API calls provided by this header next to each other, there's
absolutely no reason to have them spread apart in the header.
- Align the INTEL_CPU_DESC() macro initializations vertically, this is easier to
read and it's also the canonical style.
- Actually name the macro arguments properly: instead of 'mod, step, rev',
spell out 'model, stepping, revision' - it's not like we have a lack of
characters in this header.
- Actually make arguments macro-safe - again it's inexplicable why it wasn't
done properly to begin with.
Quite amazing how many problems a 41 lines header can contain.
This kind of code quality is unacceptable, and it slipped through the
review net of 2 developers and 2 maintainers, including myself, until
Thomas noticed it. :-/
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:
general protection fault: 0000 [#1] SMP PTI
...
RIP: 0010:perf_prepare_sample+0x8f/0x510
...
Call Trace:
<IRQ>
? intel_pmu_drain_bts_buffer+0x194/0x230
intel_pmu_drain_bts_buffer+0x160/0x230
? tick_nohz_irq_exit+0x31/0x40
? smp_call_function_single_interrupt+0x48/0xe0
? call_function_single_interrupt+0xf/0x20
? call_function_single_interrupt+0xa/0x20
? x86_schedule_events+0x1a0/0x2f0
? x86_pmu_commit_txn+0xb4/0x100
? find_busiest_group+0x47/0x5d0
? perf_event_set_state.part.42+0x12/0x50
? perf_mux_hrtimer_restart+0x40/0xb0
intel_pmu_disable_event+0xae/0x100
? intel_pmu_disable_event+0xae/0x100
x86_pmu_stop+0x7a/0xb0
x86_pmu_del+0x57/0x120
event_sched_out.isra.101+0x83/0x180
group_sched_out.part.103+0x57/0xe0
ctx_sched_out+0x188/0x240
ctx_resched+0xa8/0xd0
__perf_event_enable+0x193/0x1e0
event_function+0x8e/0xc0
remote_function+0x41/0x50
flush_smp_call_function_queue+0x68/0x100
generic_smp_call_function_single_interrupt+0x13/0x30
smp_call_function_single_interrupt+0x3e/0xe0
call_function_single_interrupt+0xf/0x20
</IRQ>
The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.
Following sequence will cause the crash:
If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.
Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable various options in the defconfigs of both i386 and x86_64
that could do with some test coverage in automated testing. Note
that these options are typically enabled by distros as well.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'write' parameter is unused in gup_fast_permitted() so remove it.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20190210223424.13934-1-ira.weiny@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A microcode patch is also needed for Goldmont while counter freezing
feature is enabled. Otherwise, there will be some issues, e.g. PMI lost.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: bp@alien8.de
Link: https://lkml.kernel.org/r/1549319013-4522-5-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Clean up counter freezing quirk to use the new facility to check for
min microcode revisions.
Rename the counter freezing quirk related functions. Because other
platforms, e.g. Goldmont, also needs to call the quirk.
Only check the boot CPU, assuming models and features are consistent
over all CPUs.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: bp@alien8.de
Link: https://lkml.kernel.org/r/1549319013-4522-4-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Clean up SNB PEBS quirk to use the new facility to check for min
microcode revisions.
Only check the boot CPU, assuming models and features are consistent
over all CPUs.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: bp@alien8.de
Link: https://lkml.kernel.org/r/1549319013-4522-3-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KVM added a workaround for PEBS events leaking into guests with
commit:
26a4f3c08d ("perf/x86: disable PEBS on a guest entry.")
This uses the VT entry/exit list to add an extra disable of the
PEBS_ENABLE MSR.
Intel also added a fix for this issue to microcode updates on
Haswell/Broadwell/Skylake.
It turns out using the MSR entry/exit list makes VM exits
significantly slower. The list is only needed for disabling
PEBS, because the GLOBAL_CTRL change gets optimized by
KVM into changing the VMCS.
Check for the microcode updates that have the microcode
fix for leaking PEBS, and disable the extra entry/exit list
entry for PEBS_ENABLE. In addition we always clear the
GLOBAL_CTRL for the PEBS counter while running in the guest,
which is enough to make them never fire at the wrong
side of the host/guest transition.
The overhead for VM exits with the filtering active with the patch is
reduced from 8% to 4%.
The microcode patch has already been merged into future platforms.
This patch is one-off thing. The quirks is used here.
For other old platforms which doesn't have microcode patch and quirks,
extra disable of the PEBS_ENABLE MSR is still required.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: bp@alien8.de
Link: https://lkml.kernel.org/r/1549319013-4522-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For bug workarounds or checks, it is useful to check for specific
microcode revisions.
Add a new generic function to match the CPU with stepping.
Add the other function to check the min microcode revisions for
the matched CPU.
A new table format is introduced to facilitate the quirk to
fill the related information.
This does not change the existing x86_cpu_id because it's an ABI
shared with modules, and also has quite different requirements,
as in no wildcards, but everything has to be matched exactly.
Originally-by: Andi Kleen <ak@linux.intel.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1549319013-4522-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This series finally gets us to the point of having system calls with
64-bit time_t on all architectures, after a long time of incremental
preparation patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures
using the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call
that includes a 'struct timespec' argument, or a structure containing
a timespec or (in case of clock_adjtime) timeval. Not included here
are new versions of getitimer/setitimer and getrusage/waitid, which
are planned for the future but only needed to make a consistent API
rather than for correct operation beyond y2038. These four system
calls are based on 'timeval', and it has not been finally decided
what the replacement kernel interface will use instead.
So far, I have done a lot of build testing across most architectures,
which has found a number of bugs. Runtime testing so far included
testing LTP on 32-bit ARM with the existing system calls, to ensure
we do not regress for existing binaries, and a test with a 32-bit
x86 build of LTP against a modified version of the musl C library
that has been adapted to the new system call interface [3].
This library can be used for testing on all architectures supported
by musl-1.1.21, but it is not how the support is getting integrated
into the official musl release. Official musl support is planned
but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcXf7/AAoJEGCrR//JCVInPSUP/RhsQSCKMGtONB/vVICQhwep
PybhzBSpHWFxszzTi6BEPN1zS9B069G9mDollRBYZCckyPqL/Bv6sI/vzQZdNk01
Q6Nw92OnNE1QP8owZ5TjrZhpbtopWdqIXjsbGZlloUemvuJP2JwvKovQUcn5CPTQ
jbnqU04CVyFFJYVxAnGJ+VSeWNrjW/cm/m+rhLFjUcwW7Y3aodxsPqPP6+K9hY9P
yIWfcH42WBeEWGm1RSBOZOScQl4SGCPUAhFydl/TqyEQagyegJMIyMOv9wZ5AuTT
xK644bDVmNsrtJDZDpx+J8hytXCk1LrnKzkHR/uK80iUIraF/8D7PlaPgTmEEjko
XcrywEkvkXTVU3owCm2/sbV+8fyFKzSPipnNfN1JNxEX71A98kvMRtPjDueQq/GA
Yh81rr2YLF2sUiArkc2fNpENT7EGhrh1q6gviK3FB8YDgj1kSgPK5wC/X0uolC35
E7iC2kg4NaNEIjhKP/WKluCaTvjRbvV+0IrlJLlhLTnsqbA57ZKCCteiBrlm7wQN
4csUtCyxchR9Ac2o/lj+Mf53z68Zv74haIROp18K2dL7ZpVcOPnA3XHeauSAdoyp
wy2Ek6ilNvlNB+4x+mRntPoOsyuOUGv7JXzB9JvweLWUd9G7tvYeDJQp/0YpDppb
K4UWcKnhtEom0DgK08vY
=IZVb
-----END PGP SIGNATURE-----
Merge tag 'y2038-new-syscalls' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull y2038 - time64 system calls from Arnd Bergmann:
This series finally gets us to the point of having system calls with 64-bit
time_t on all architectures, after a long time of incremental preparation
patches.
There was actually one conversion that I missed during the summer,
i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
and review comments.
The following system calls are now added on all 32-bit architectures using
the same system call numbers:
403 clock_gettime64
404 clock_settime64
405 clock_adjtime64
406 clock_getres_time64
407 clock_nanosleep_time64
408 timer_gettime64
409 timer_settime64
410 timerfd_gettime64
411 timerfd_settime64
412 utimensat_time64
413 pselect6_time64
414 ppoll_time64
416 io_pgetevents_time64
417 recvmmsg_time64
418 mq_timedsend_time64
419 mq_timedreceiv_time64
420 semtimedop_time64
421 rt_sigtimedwait_time64
422 futex_time64
423 sched_rr_get_interval_time64
Each one of these corresponds directly to an existing system call that
includes a 'struct timespec' argument, or a structure containing a timespec
or (in case of clock_adjtime) timeval. Not included here are new versions
of getitimer/setitimer and getrusage/waitid, which are planned for the
future but only needed to make a consistent API rather than for correct
operation beyond y2038. These four system calls are based on 'timeval', and
it has not been finally decided what the replacement kernel interface will
use instead.
So far, I have done a lot of build testing across most architectures, which
has found a number of bugs. Runtime testing so far included testing LTP on
32-bit ARM with the existing system calls, to ensure we do not regress for
existing binaries, and a test with a 32-bit x86 build of LTP against a
modified version of the musl C library that has been adapted to the new
system call interface [3]. This library can be used for testing on all
architectures supported by musl-1.1.21, but it is not how the support is
getting integrated into the official musl release. Official musl support is
planned but will require more invasive changes to the library.
Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]
The system call tables have diverged a bit over the years, and a number
of the recent additions never made it into all architectures, for one
reason or another.
This is an attempt to clean it up as far as we can without breaking
compatibility, doing a number of steps:
- Add system calls that have not yet been integrated into all
architectures but that we definitely want there. This includes
{,f}statfs64() and get{eg,eu,g,p,u,pp}id() on alpha, which have
been missing traditionally.
- The s390 compat syscall handling is cleaned up to be more like
what we do on other architectures, while keeping the 31-bit
pointer extension. This was merged as a shared branch by the
s390 maintainers and is included here in order to base the other
patches on top.
- Add the separate ipc syscalls on all architectures that
traditionally only had sys_ipc(). This version is done without
support for IPC_OLD that is we have in sys_ipc. The
new semtimedop_time64 syscall will only be added here, not
in sys_ipc
- Add syscall numbers for a couple of syscalls that we probably
don't need everywhere, in particular pkey_* and rseq,
for the purpose of symmetry: if it's in asm-generic/unistd.h,
it makes sense to have it everywhere. I expect that any future
system calls will get assigned on all platforms together, even
when they appear to be specific to a single architecture.
- Prepare for having the same system call numbers for any future
calls. In combination with the generated tables, this hopefully
makes it easier to add new calls across all architectures
together.
All of the above are technically separate from the y2038 work,
but are done as preparation before we add the new 64-bit time_t
system calls everywhere, providing a common baseline set of system
calls.
I expect that glibc and other libraries that want to use 64-bit
time_t will require linux-5.1 kernel headers for building in
the future, and at a much later point may also require linux-5.1
or a later version as the minimum kernel at runtime. Having a
common baseline then allows the removal of many architecture or
kernel version specific workarounds.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJcXf6XAAoJEGCrR//JCVInIm4P/AlkMmQRa/B2ziWMW6PifPoI
v18r44017rA1BPENyZvumJUdM5mDvNofOW8F2DYQ7Uiys2YtXenwe/Cf8LHn2n6c
TMXGQryQpvNmfDCyU+0UjF8m2+poFMrL4aRTXtjODh1YTsPNgeDC+KFMCAAtZmZd
cVbXFudtbdYKD/pgCX4SI1CWAMBiXe2e+ukPdJVr+iqusCMTApf+GOuyvDBZY9s/
vURb+tIS87HZ/jehWfZFSuZt+Gu7b3ijUXNC8v9qSIxNYekw62vBNl6F09HE79uB
Bv4OujAODqKvI9gGyydBzLJNzaMo0ryQdusyqcJHT7MY/8s+FwcYAXyTlQ3DbbB4
2u/c+58OwJ9Zk12p4LXZRA47U+vRhQt2rO4+zZWs2txNNJY89ZvCm/Z04KOiu5Xz
1Nnj607KGzthYRs2gs68AwzGGyf0uykIQ3RcaJLIBlX1Nd8BWO0ZgAguCvkXbQMX
XNXJTd92HmeuKKpiO0n/M4/mCeP0cafBRPCZbKlHyTl0Jeqd/HBQEO9Z8Ifwyju3
mXz9JCR9VlPCkX605keATbjtPGZf3XQtaXlQnezitDudXk8RJ33EpPcbhx76wX7M
Rux37ByqEOzk4wMGX9YQyNU7z7xuVg4sJAa2LlJqYeKXHtym+u3gG7SGP5AsYjmk
6mg2+9O2yZuLhQtOtrwm
=s4wf
-----END PGP SIGNATURE-----
Merge tag 'y2038-syscall-cleanup' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull preparatory work for y2038 changes from Arnd Bergmann:
System call unification and cleanup
The system call tables have diverged a bit over the years, and a number of
the recent additions never made it into all architectures, for one reason
or another.
This is an attempt to clean it up as far as we can without breaking
compatibility, doing a number of steps:
- Add system calls that have not yet been integrated into all architectures
but that we definitely want there. This includes {,f}statfs64() and
get{eg,eu,g,p,u,pp}id() on alpha, which have been missing traditionally.
- The s390 compat syscall handling is cleaned up to be more like what we
do on other architectures, while keeping the 31-bit pointer
extension. This was merged as a shared branch by the s390 maintainers
and is included here in order to base the other patches on top.
- Add the separate ipc syscalls on all architectures that traditionally
only had sys_ipc(). This version is done without support for IPC_OLD
that is we have in sys_ipc. The new semtimedop_time64 syscall will only
be added here, not in sys_ipc
- Add syscall numbers for a couple of syscalls that we probably don't need
everywhere, in particular pkey_* and rseq, for the purpose of symmetry:
if it's in asm-generic/unistd.h, it makes sense to have it everywhere. I
expect that any future system calls will get assigned on all platforms
together, even when they appear to be specific to a single architecture.
- Prepare for having the same system call numbers for any future calls. In
combination with the generated tables, this hopefully makes it easier to
add new calls across all architectures together.
All of the above are technically separate from the y2038 work, but are done
as preparation before we add the new 64-bit time_t system calls everywhere,
providing a common baseline set of system calls.
I expect that glibc and other libraries that want to use 64-bit time_t will
require linux-5.1 kernel headers for building in the future, and at a much
later point may also require linux-5.1 or a later version as the minimum
kernel at runtime. Having a common baseline then allows the removal of many
architecture or kernel version specific workarounds.
Pull x86 fixes from Ingo Molnar:
"A handful of fixes:
- Fix an MCE corner case bug/crash found via MCE injection testing
- Fix 5-level paging boot crash
- Fix MCE recovery cache invalidation bug
- Fix regression on Xen guests caused by a recent PMD level mremap
speedup optimization"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Make set_pmd_at() paravirt aware
x86/mm/cpa: Fix set_mce_nospec()
x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting
x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
Pull perf fixes from Ingo Molnar:
"A couple of kernel side fixes:
- Fix the Intel uncore driver on certain hardware configurations
- Fix a CPU hotplug related memory allocation bug
- Remove a spurious WARN()
... plus also a handful of perf tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf script python: Add Python3 support to tests/attr.py
perf trace: Support multiple "vfs_getname" probes
perf symbols: Filter out hidden symbols from labels
perf symbols: Add fallback definitions for GELF_ST_VISIBILITY()
tools headers uapi: Sync linux/in.h copy from the kernel sources
perf clang: Do not use 'return std::move(something)'
perf mem/c2c: Fix perf_mem_events to support powerpc
perf tests evsel-tp-sched: Fix bitwise operator
perf/core: Don't WARN() for impossible ring-buffer sizes
perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
perf/x86/intel/uncore: Add Node ID mask
set_pmd_at() calls native_set_pmd() unconditionally on x86. This was
fine as long as only huge page entries were written via set_pmd_at(),
as Xen pv guests don't support those.
Commit 2c91bd4a4e ("mm: speed up mremap by 20x on large regions")
introduced a usage of set_pmd_at() possible on pv guests, leading to
failures like:
BUG: unable to handle kernel paging request at ffff888023e26778
#PF error: [PROT] [WRITE]
RIP: e030:move_page_tables+0x7c1/0xae0
move_vma.isra.3+0xd1/0x2d0
__se_sys_mremap+0x3c6/0x5b0
do_syscall_64+0x49/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Make set_pmd_at() paravirt aware by just letting it use set_pmd().
Fixes: 2c91bd4a4e ("mm: speed up mremap by 20x on large regions")
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: xen-devel@lists.xenproject.org
Cc: boris.ostrovsky@oracle.com
Cc: sstabellini@kernel.org
Cc: hpa@zytor.com
Cc: bp@alien8.de
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20190210074056.11842-1-jgross@suse.com
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.
Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
WARNING: vmlinux.o(.text.unlikely+0x1c05): Section mismatch in
reference from the function init_xstate_size() to the
function .init.text:get_xsave_size()
WARNING: vmlinux.o(.text.unlikely+0x1c19): Section mismatch in
reference from the function init_xstate_size() to the
function .init.text:get_xsaves_size()
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Link: https://lkml.kernel.org/r/20190108130225.5066-2-sergey.senozhatsky@gmail.com
WARNING: vmlinux.o(.text.unlikely+0x498d): Section mismatch in reference
from the function percpu_setup_debug_store() to the function
.init.text:cea_map_percpu_pages()
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Link: https://lkml.kernel.org/r/20190108130225.5066-1-sergey.senozhatsky@gmail.com
Compiling the kernel with W=1 results in the following warning:
arch/x86/kernel/cpu/mtrr/cleanup.c:299:16: warning: variable ‘second_basek’ set but not used [-Wunused-but-set-variable]
unsigned long second_basek, second_sizek;
Remove the unused variable.
[ tglx: Massaged changelog ]
Signed-off-by: Bo Yu <tsu.yubo@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: puwen@hygon.cn
Link: https://lkml.kernel.org/r/20190208125343.11451-1-tsu.yubo@gmail.com
The recent commit fe0937b24f ("x86/mm/cpa: Fold cpa_flush_range() and
cpa_flush_array() into a single cpa_flush() function") accidentally made
the call to make_addr_canonical_again() go away, which breaks
set_mce_nospec().
Re-instate the call to convert the address back into canonical form right
before invoking either CLFLUSH or INVLPG. Rename the function while at it
to be shorter (and less MAGA).
Fixes: fe0937b24f ("x86/mm/cpa: Fold cpa_flush_range() and cpa_flush_array() into a single cpa_flush() function")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lkml.kernel.org/r/20190208120859.GH32511@hirez.programming.kicks-ass.net
gcmaes_crypt_by_sg() dereferences the NULL pointer returned by
scatterwalk_ffwd() when encrypting an empty plaintext and the source
scatterlist ends immediately after the associated data.
Fix it by only fast-forwarding to the src/dst data scatterlists if the
data length is nonzero.
This bug is reproduced by the "rfc4543(gcm(aes))" test vectors when run
with the new AEAD test manager.
Fixes: e845520707 ("crypto: aesni - Update aesni-intel_glue to use scatter/gather")
Cc: <stable@vger.kernel.org> # v4.17+
Cc: Dave Watson <davejwatson@fb.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 MORUS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.
Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().
Fix these bugs.
Fixes: 56e8e57fc3 ("crypto: morus - Add common SIMD glue code for MORUS")
Cc: <stable@vger.kernel.org> # v4.18+
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 AEGIS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.
Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().
Fix these bugs.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Cc: <stable@vger.kernel.org> # v4.18+
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86, arm, and arm64 asm implementations of crct10dif are very
difficult to understand partly because many of the comments, labels, and
macros are named incorrectly: the lengths mentioned are usually off by a
factor of two from the actual code. Many other things are unnecessarily
convoluted as well, e.g. there are many more fold constants than
actually needed and some aren't fully reduced.
This series therefore cleans up all these implementations to be much
more maintainable. I also made some small optimizations where I saw
opportunities, resulting in slightly better performance.
This patch cleans up the x86 version.
As part of this, I removed support for len < 16 from the x86 assembly;
now the glue code falls back to the generic table-based implementation
in this case. Due to the overhead of kernel_fpu_begin(), this actually
significantly improves performance on these lengths. (And even if
kernel_fpu_begin() were free, the generic code is still faster for about
len < 11.) This removal also eliminates error-prone special cases and
makes the x86, arm32, and arm64 ports of the code match more closely.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Bugzilla: 1671904
There are multiple code paths where an hrtimer may have been started to
emulate an L1 VMX preemption timer that can result in a call to free_nested
without an intervening L2 exit where the hrtimer is normally
cancelled. Unconditionally cancel in free_nested to cover all cases.
Embargoed until Feb 7th 2019.
Signed-off-by: Peter Shier <pshier@google.com>
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@kernel.org
Message-Id: <20181011184646.154065-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bugzilla: 1671930
Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
when passed an operand that points to an MMIO address. The page fault
will use uninitialized kernel stack memory as the CR2 and error code.
The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
exit to userspace; however, it is not an easy fix, so for now just
ensure that the error code and CR2 are zero.
Embargoed until Feb 7th 2019.
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On Denverton's integration of the Intel(R) Trace Hub (for a reference and
overview see Documentation/trace/intel_th.rst) the reported size of one of
its resources (RTIT_BAR) doesn't match its actual size, which leads to
overlaps with other devices' resources.
In practice, it overlaps with XHCI MMIO space, which results in the xhci
driver bailing out after seeing its registers as 0xffffffff, and perceived
disappearance of all USB devices:
intel_th_pci 0000:00:1f.7: enabling device (0004 -> 0006)
xhci_hcd 0000:00:15.0: xHCI host controller not responding, assume dead
xhci_hcd 0000:00:15.0: xHC not responding in xhci_irq, assume controller is dead
xhci_hcd 0000:00:15.0: HC died; cleaning up
usb 1-1: USB disconnect, device number 2
For this reason, we need to resize the RTIT_BAR on Denverton to its actual
size, which in this case is 4MB. The corresponding erratum is DNV36 at the
link below:
DNV36. Processor Host Root Complex May Incorrectly Route Memory
Accesses to Intel® Trace Hub
Problem: The Intel® Trace Hub RTIT_BAR (B0:D31:F7 offset 20h) is
reported as a 2KB memory range. Due to this erratum, the
processor Host Root Complex will forward addresses from
RTIT_BAR to RTIT_BAR + 4MB -1 to Intel® Trace Hub.
Implication: Devices assigned within the RTIT_BAR to RTIT_BAR + 4MB -1
space may not function correctly.
Workaround: A BIOS code change has been identified and may be
implemented as a workaround for this erratum.
Status: No Fix.
Note that 5118ccd347 ("intel_th: pci: Add Denverton SOC support") updates
the Trace Hub driver so it claims the Denverton device, but the resource
overlap exists regardless of whether that driver is loaded or that commit
is included.
Link: https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/atom-c3000-family-spec-update.pdf
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
[bhelgaas: include erratum text, clarify relationship with 5118ccd347]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
This adds 21 new system calls on each ABI that has 32-bit time_t
today. All of these have the exact same semantics as their existing
counterparts, and the new ones all have macro names that end in 'time64'
for clarification.
This gets us to the point of being able to safely use a C library
that has 64-bit time_t in user space. There are still a couple of
loose ends to tie up in various areas of the code, but this is the
big one, and should be entirely uncontroversial at this point.
In particular, there are four system calls (getitimer, setitimer,
waitid, and getrusage) that don't have a 64-bit counterpart yet,
but these can all be safely implemented in the C library by wrapping
around the existing system calls because the 32-bit time_t they
pass only counts elapsed time, not time since the epoch. They
will be dealt with later.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
The time, stime, utime, utimes, and futimesat system calls are only
used on older architectures, and we do not provide y2038 safe variants
of them, as they are replaced by clock_gettime64, clock_settime64,
and utimensat_time64.
However, for consistency it seems better to have the 32-bit architectures
that still use them call the "time32" entry points (leaving the
traditional handlers for the 64-bit architectures), like we do for system
calls that now require two versions.
Note: We used to always define __ARCH_WANT_SYS_TIME and
__ARCH_WANT_SYS_UTIME and only set __ARCH_WANT_COMPAT_SYS_TIME and
__ARCH_WANT_SYS_UTIME32 for compat mode on 64-bit kernels. Now this is
reversed: only 64-bit architectures set __ARCH_WANT_SYS_TIME/UTIME, while
we need __ARCH_WANT_SYS_TIME32/UTIME32 for 32-bit architectures and compat
mode. The resulting asm/unistd.h changes look a bit counterintuitive.
This is only a cleanup patch and it should not change any behavior.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
This is the big flip, where all 32-bit architectures set COMPAT_32BIT_TIME
and use the _time32 system calls from the former compat layer instead
of the system calls that take __kernel_timespec and similar arguments.
The temporary redirects for __kernel_timespec, __kernel_itimerspec
and __kernel_timex can get removed with this.
It would be easy to split this commit by architecture, but with the new
generated system call tables, it's easy enough to do it all at once,
which makes it a little easier to check that the changes are the same
in each table.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.
The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.
Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.
In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
x32 has always followed the time64 calling conventions of these
syscalls, which required a special hack in compat_get_timespec
aka get_old_timespec32 to continue working.
Since we now have the time64 syscalls, use those explicitly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
paging_prepare() returns a two-quadword structure which lands
into RDX:RAX:
- Address of the trampoline is returned in RAX.
- Non zero RDX means trampoline needs to enable 5-level paging.
Document that explicitly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: dave.hansen@linux.intel.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kyle D Pelton <kyle.d.pelton@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Huang <wei@redhat.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190206154756.matwldebbxkmlnae@black.fi.intel.com
RDMSR in the trampoline code overwrites EDX but that register is used
to indicate whether 5-level paging has to be enabled and if clobbered,
leads to failure to boot on a 5-level paging machine.
Preserve EDX on the stack while we are dealing with EFER.
Fixes: b677dfae5a ("x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode")
Reported-by: Kyle D Pelton <kyle.d.pelton@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: dave.hansen@linux.intel.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Huang <wei@redhat.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190206115253.1907-1-kirill.shutemov@linux.intel.com
This pushes the handling of inversion semantics and open drain
settings to the GPIO descriptor and gpiolib. All affected board
files are also augmented.
This is especially nice since we don't have to have any
confusing flags passed around to the left and right littering
the fixed and GPIO regulator drivers and the regulator core.
It is all just very straight-forward: the core asks the GPIO
line to be asserted or deasserted and gpiolib deals with the
rest depending on how the platform is configured: if the line
is active low, it deals with that, if the line is open drain,
it deals with that too.
Cc: Alexander Shiyan <shc_work@mail.ru> # i.MX boards user
Cc: Haojian Zhuang <haojian.zhuang@gmail.com> # MMP2 maintainer
Cc: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1 maintainer
Cc: Tony Lindgren <tony@atomide.com> # OMAP1,2,3 maintainer
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> # EM-X270 maintainer
Cc: Robert Jarzmik <robert.jarzmik@free.fr> # EZX maintainer
Cc: Philipp Zabel <philipp.zabel@gmail.com> # Magician maintainer
Cc: Petr Cvek <petr.cvek@tul.cz> # Magician
Cc: Robert Jarzmik <robert.jarzmik@free.fr> # PXA
Cc: Paul Parsons <lost.distance@yahoo.com> # hx4700
Cc: Daniel Mack <zonque@gmail.com> # Raumfeld maintainer
Cc: Marc Zyngier <marc.zyngier@arm.com> # Zeus maintainer
Cc: Geert Uytterhoeven <geert+renesas@glider.be> # SuperH pinctrl/GPIO maintainer
Cc: Russell King <rmk+kernel@armlinux.org.uk> # SA1100
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Janusz Krzysztofik <jmkrzyszt@gmail.com> #OMAP1 Amstrad Delta
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
When efi=noruntime or efi=oldmap is used on the kernel command line, EFI
services won't be available in the second kernel, therefore the second
kernel will not be able to get the ACPI RSDP address from firmware by
calling EFI services and so it won't boot.
Commit
e6e094e053 ("x86/acpi, x86/boot: Take RSDP address from boot params if available")
added an acpi_rsdp_addr field to boot_params which stores the RSDP
address for other kernel users.
Recently, after
3a63f70bf4 ("x86/boot: Early parse RSDP and save it in boot_params")
the acpi_rsdp_addr will always be filled with a valid RSDP address.
So fill in that value into the second kernel's boot_params thus ensuring
that the second kernel receives the RSDP value from the first kernel.
[ bp: massage commit message. ]
Signed-off-by: Kairui Song <kasong@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kexec@lists.infradead.org
Cc: Philipp Rudo <prudo@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Cc: Yannik Sembritzki <yannik@sembritzki.me>
Link: https://lkml.kernel.org/r/20190204173852.4863-1-kasong@redhat.com
When pmu::setup_aux() is called the coresight PMU needs to know which
sink to use for the session by looking up the information in the
event's attr::config2 field.
As such simply replace the cpu information by the complete perf_event
structure and change all affected customers.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: Suzuki Poulouse <suzuki.poulose@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20190131184714.20388-2-mathieu.poirier@linaro.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When building randconfigs, one of the failures is:
ld: arch/x86/boot/compressed/kaslr.o: in function `choose_random_location':
kaslr.c:(.text+0xbf7): undefined reference to `count_immovable_mem_regions'
ld: kaslr.c:(.text+0xcbe): undefined reference to `immovable_mem'
make[2]: *** [arch/x86/boot/compressed/vmlinux] Error 1
because CONFIG_ACPI is not enabled in this particular .config but
CONFIG_MEMORY_HOTREMOVE is and count_immovable_mem_regions() is
unresolvable because it is defined in compressed/acpi.c which is the
compilation unit that depends on CONFIG_ACPI.
Add CONFIG_ACPI to the explicit dependencies for MEMORY_HOTREMOVE.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/20190205131033.9564-1-bp@alien8.de
ac09c5f43c ("x86/boot: Build the command line parsing code unconditionally")
enabled building the command line parsing code unconditionally but it
forgot to remove the respective ifdeffery around the prototypes in the
misc.h header, leading to
arch/x86/boot/compressed/acpi.c: In function ‘get_acpi_rsdp’:
arch/x86/boot/compressed/acpi.c:37:8: warning: implicit declaration of function \
‘cmdline_find_option’ [-Wimplicit-function-declaration]
ret = cmdline_find_option("acpi_rsdp", val, MAX_ADDR_LEN);
^~~~~~~~~~~~~~~~~~~
for configs where neither CONFIG_EARLY_PRINTK nor CONFIG_RANDOMIZE_BASE
was defined.
Drop the ifdeffery in the header too.
Fixes: ac09c5f43c ("x86/boot: Build the command line parsing code unconditionally")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Chao Fan <fanc.fnst@cn.fujitsu.com>
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/5c51daf0.83pQEkvDZILqoSYW%lkp@intel.com
Link: https://lkml.kernel.org/r/20190205131352.GA27396@zn.tnic
The definition of MSR_MISC_FEATURE_CONTROL was first introduced in
98af74599e ("x86 msr_index.h: Define MSR_MISC_FEATURE_CONTROL")
and present in Linux since v4.11.
The Cache Pseudo-Locking code added this duplicate definition in more
recent
f2a177292b ("x86/intel_rdt: Discover supported platforms via prefetch disable bits"),
available since v4.19.
Remove the duplicate definition from the resctrl subsystem and let that
code obtain the needed definition from the core architecture msr-index.h
instead.
Fixes: f2a177292b ("x86/intel_rdt: Discover supported platforms via prefetch disable bits")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: gavin.hindman@intel.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: jithu.joseph@intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/ff6b95d9b6ef6f4ac96267f130719ba1af09614b.1549312475.git.reinette.chatre@intel.com
This patch allows the kexec_file_load syscall to verify the PE signed
kernel image signature based on the preboot keys stored in the .platform
keyring, as fall back, if the signature verification failed due to not
finding the public key in the secondary or builtin keyrings.
This commit adds a VERIFY_USE_PLATFORM_KEYRING similar to previous
VERIFY_USE_SECONDARY_KEYRING indicating that verify_pkcs7_signature
should verify the signature using platform keyring. Also, decrease
the error message log level when verification failed with -ENOKEY,
so that if called tried multiple time with different keyring it
won't generate extra noises.
Signed-off-by: Kairui Song <kasong@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com> (for kexec_file_load part)
[zohar@linux.ibm.com: tweaked the first paragraph of the patch description,
and fixed checkpatch warning.]
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
This adds an smp_acquire__after_ctrl_dep() barrier on successful
decrease of refcounter value from 1 to 0 for refcount_dec(sub)_and_test
variants and therefore gives stronger memory ordering guarantees than
prior versions of these functions.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: dvyukov@google.com
Cc: keescook@chromium.org
Cc: stern@rowland.harvard.edu
Link: https://lkml.kernel.org/r/1548847131-27854-2-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
intel_pmu_cpu_prepare() allocated memory for ->shared_regs among other
members of struct cpu_hw_events. This memory is released in
intel_pmu_cpu_dying() which is wrong. The counterpart of the
intel_pmu_cpu_prepare() callback is x86_pmu_dead_cpu().
Otherwise if the CPU fails on the UP path between CPUHP_PERF_X86_PREPARE
and CPUHP_AP_PERF_X86_STARTING then it won't release the memory but
allocate new memory on the next attempt to online the CPU (leaking the
old memory).
Also, if the CPU down path fails between CPUHP_AP_PERF_X86_STARTING and
CPUHP_PERF_X86_PREPARE then the CPU will go back online but never
allocate the memory that was released in x86_pmu_dying_cpu().
Make the memory allocation/free symmetrical in regard to the CPU hotplug
notifier by moving the deallocation to intel_pmu_cpu_dead().
This started in commit:
a7e3ed1e47 ("perf: Add support for supplementary event registers").
In principle the bug was introduced in v2.6.39 (!), but it will almost
certainly not backport cleanly across the big CPU hotplug rewrite between v4.7-v4.15...
[ bigeasy: Added patch description. ]
[ mingo: Added backporting guidance. ]
Reported-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # With developer hat on
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> # With maintainer hat on
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: jolsa@kernel.org
Cc: kan.liang@linux.intel.com
Cc: namhyung@kernel.org
Cc: <stable@vger.kernel.org>
Fixes: a7e3ed1e47 ("perf: Add support for supplementary event registers").
Link: https://lkml.kernel.org/r/20181219165350.6s3jvyxbibpvlhtq@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some PCI uncore PMUs cannot be registered on an 8-socket system (HPE
Superdome Flex).
To understand which Socket the PCI uncore PMUs belongs to, perf retrieves
the local Node ID of the uncore device from CPUNODEID(0xC0) of the PCI
configuration space, and the mapping between Socket ID and Node ID from
GIDNIDMAP(0xD4). The Socket ID can be calculated accordingly.
The local Node ID is only available at bit 2:0, but current code doesn't
mask it. If a BIOS doesn't clear the rest of the bits, an incorrect Node ID
will be fetched.
Filter the Node ID by adding a mask.
Reported-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org> # v3.7+
Fixes: 7c94ee2e09 ("perf/x86: Add Intel Nehalem and Sandy Bridge-EP uncore support")
Link: https://lkml.kernel.org/r/1548600794-33162-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the x86 EFI earlyprintk implementation to a shared location under
drivers/firmware and tweak it slightly so we can expose it as an earlycon
implementation (which is generic) rather than earlyprintk (which is only
implemented for a few architectures)
This also involves switching to write-combine mappings by default (which
is required on ARM since device mappings lack memory semantics, and so
memcpy/memset may not be used on them), and adding support for shared
memory framebuffers on cache coherent non-x86 systems (which do not
tolerate mismatched attributes).
Note that 32-bit ARM does not populate its struct screen_info early
enough for earlycon=efifb to work, so it is disabled there.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jeffrey Hugo <jhugo@codeaurora.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190202094119.13230-10-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn ARCH_USE_MEMREMAP_PROT into a generic Kconfig symbol, and fix the
dependency expression to reflect that AMD_MEM_ENCRYPT depends on it,
instead of the other way around. This will permit ARCH_USE_MEMREMAP_PROT
to be selected by other architectures.
Note that the encryption related early memremap routines in
arch/x86/mm/ioremap.c cannot be built for 32-bit x86 without triggering
the following warning:
arch/x86//mm/ioremap.c: In function 'early_memremap_encrypted':
>> arch/x86/include/asm/pgtable_types.h:193:27: warning: conversion from
'long long unsigned int' to 'long unsigned int' changes
value from '9223372036854776163' to '355' [-Woverflow]
#define __PAGE_KERNEL_ENC (__PAGE_KERNEL | _PAGE_ENC)
^~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86//mm/ioremap.c:713:46: note: in expansion of macro '__PAGE_KERNEL_ENC'
return early_memremap_prot(phys_addr, size, __PAGE_KERNEL_ENC);
which essentially means they are 64-bit only anyway. However, we cannot
make them dependent on CONFIG_ARCH_HAS_MEM_ENCRYPT, since that is always
defined, even for i386 (and changing that results in a slew of build errors)
So instead, build those routines only if CONFIG_AMD_MEM_ENCRYPT is
defined.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jeffrey Hugo <jhugo@codeaurora.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190202094119.13230-9-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
can_free_region() is called only once during boot, by
efi_reserve_boot_services().
Hence, mark it as an __init function.
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jeffrey Hugo <jhugo@codeaurora.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190202094119.13230-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A few updates for x86:
- Fix an unintended sign extension issue in the fault handling code
- Rename the new resource control config switch so it's less
confusing
- Avoid setting up EFI info in kexec when the EFI runtime is
disabled.
- Fix the microcode version check in the AMD microcode loader so it
only loads higher version numbers and never downgrades
- Set EFER.LME in the 32bit trampoline before returning to long mode
to handle older AMD/KVM behaviour properly.
- Add Darren and Andy as x86/platform reviewers"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Avoid confusion over the new X86_RESCTRL config
x86/kexec: Don't setup EFI info if EFI runtime is not enabled
x86/microcode/amd: Don't falsely trick the late loading mechanism
MAINTAINERS: Add Andy and Darren as arch/x86/platform/ reviewers
x86/fault: Fix sign-extend unintended sign extension
x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
x86/cpu: Add Atom Tremont (Jacobsville)
Pull cpu hotplug fixes from Thomas Gleixner:
"Two fixes for the cpu hotplug machinery:
- Replace the overly clever 'SMT disabled by BIOS' detection logic as
it breaks KVM scenarios and prevents speculation control updates
when the Hyperthreads are brought online late after boot.
- Remove a redundant invocation of the speculation control update
function"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
x86/speculation: Remove redundant arch_smt_update() invocation
Internal injection testing crashed with a console log that said:
mce: [Hardware Error]: CPU 7: Machine Check Exception: f Bank 0: bd80000000100134
This caused a lot of head scratching because the MCACOD (bits 15:0) of
that status is a signature from an L1 data cache error. But Linux says
that it found it in "Bank 0", which on this model CPU only reports L1
instruction cache errors.
The answer was that Linux doesn't initialize "m->bank" in the case that
it finds a fatal error in the mce_no_way_out() pre-scan of banks. If
this was a local machine check, then this partially initialized struct
mce is being passed to mce_panic().
Fix is simple: just initialize m->bank in the case of a fatal error.
Fixes: 40c36e2741 ("x86/mce: Fix incorrect "Machine check from unknown source" message")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: stable@vger.kernel.org # v4.18 Note pre-v5.0 arch/x86/kernel/cpu/mce/core.c was called arch/x86/kernel/cpu/mcheck/mce.c
Link: https://lkml.kernel.org/r/20190201003341.10638-1-tony.luck@intel.com
Some SMCA bank types on future systems will report new error types even
though the bank type is not treated as a new version. These new error
types will reported by bits that are reserved in past systems.
Add the new error descriptions to the lists in edac_mce_amd.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Shirish S <Shirish.S@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190201225534.8177-4-Yazen.Ghannam@amd.com
The existing CS, PSP, and SMU SMCA bank types will see new versions (as
indicated by their McaTypes) in future SMCA systems.
Add the new (HWID, MCATYPE) tuples for these new versions. Reuse the
same names as the older versions, since they are logically the same to
the user. SMCA systems won't mix and match IP blocks with different
McaType versions in the same system, so there isn't a need to
distinguish them. The MCA_IPID register is saved when logging an MCA
error, and that can be used to triage the error.
Also, add the new error descriptions to edac_mce_amd. Some error types
(positions in the list) are overloaded compared to the previous
McaTypes. Therefore, just create new lists of the error descriptions to
keep things simple even if some of the error descriptions are the same
between versions.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Shirish S <Shirish.S@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190201225534.8177-3-Yazen.Ghannam@amd.com
"Resource Control" is a very broad term for this CPU feature, and a term
that is also associated with containers, cgroups etc. This can easily
cause confusion.
Make the user prompt more specific. Match the config symbol name.
[ bp: In the future, the corresponding ARM arch-specific code will be
under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
under the CPU_RESCTRL umbrella symbol. ]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linux-doc@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
If the kernel is configured with KASAN_EXTRA, the stack size is
increasted significantly because this option sets "-fstack-reuse" to
"none" in GCC [1]. As a result, it triggers stack overrun quite often
with 32k stack size compiled using GCC 8. For example, this reproducer
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c
triggers a "corrupted stack end detected inside scheduler" very reliably
with CONFIG_SCHED_STACK_END_CHECK enabled.
There are just too many functions that could have a large stack with
KASAN_EXTRA due to large local variables that have been called over and
over again without being able to reuse the stacks. Some noticiable ones
are
size
7648 shrink_page_list
3584 xfs_rmap_convert
3312 migrate_page_move_mapping
3312 dev_ethtool
3200 migrate_misplaced_transhuge_page
3168 copy_process
There are other 49 functions are over 2k in size while compiling kernel
with "-Wframe-larger-than=" even with a related minimal config on this
machine. Hence, it is too much work to change Makefiles for each object
to compile without "-fsanitize-address-use-after-scope" individually.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23
Although there is a patch in GCC 9 to help the situation, GCC 9 probably
won't be released in a few months and then it probably take another
6-month to 1-year for all major distros to include it as a default.
Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020
when GCC 9 is everywhere. Until then, this patch will help users avoid
stack overrun.
This has already been fixed for arm64 for the same reason via
6e8830674e ("arm64: kasan: Increase stack size for KASAN_EXTRA").
Link: http://lkml.kernel.org/r/20190109215209.2903-1-cai@lca.pw
Signed-off-by: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
I made our copy from IO space use a separate copy routine rather than
rely on the generic memcpy. I did that because our generic memory copy
isn't actually well-defined when it comes to internal access ordering or
alignment, and will in fact depend on various CPUID flags.
In particular, the default memcpy() for a modern Intel CPU will
generally be just a "rep movsb", which works reasonably well for
medium-sized memory copies of regular RAM, since the CPU will turn it
into fairly optimized microcode.
However, for non-cached memory and IO, "rep movs" ends up being
horrendously slow and will just do the architectural "one byte at a
time" accesses implied by the movsb.
At the other end of the spectrum, if you _don't_ end up using the "rep
movsb" code, you'd likely fall back to the software copy, which does
overlapping accesses for the tail, and may copy things backwards.
Again, for regular memory that's fine, for IO memory not so much.
The thinking was that clearly nobody really cared (because things
worked), but some people had seen horrible performance due to the byte
accesses, so let's just revert back to our long ago version that dod
"rep movsl" for the bulk of the copy, and then fixed up the potentially
last few bytes of the tail with "movsw/b".
Interestingly (and perhaps not entirely surprisingly), while that was
our original memory copy implementation, and had been used before for
IO, in the meantime many new users of memcpy_*io() had come about. And
while the access patterns for the memory copy weren't well-defined (so
arguably _any_ access pattern should work), in practice the "rep movsb"
case had been very common for the last several years.
In particular Jarkko Sakkinen reported that the memcpy_*io() change
resuled in weird errors from his Geminilake NUC TPM module.
And it turns out that the TPM TCG accesses according to spec require
that the accesses be
(a) done strictly sequentially
(b) be naturally aligned
otherwise the TPM chip will abort the PCI transaction.
And, in fact, the tpm_crb.c driver did this:
memcpy_fromio(buf, priv->rsp, 6);
...
memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);
which really should never have worked in the first place, but back
before commit 170d13ca3a it *happened* to work, because the
memcpy_fromio() would be expanded to a regular memcpy, and
(a) gcc would expand the first memcpy in-line, and turn it into a
4-byte and a 2-byte read, and they happened to be in the right
order, and the alignment was right.
(b) gcc would call "memcpy()" for the second one, and the machines that
had this TPM chip also apparently ended up always having ERMS
("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
movbs" for that copy.
In other words, basically by pure luck, the code happened to use the
right access sizes in the (two different!) memcpy() implementations to
make it all work.
But after commit 170d13ca3a, both of the memcpy_fromio() calls
resulted in a call to the routine with the consistent memory accesses,
and in both cases it started out transferring with 4-byte accesses.
Which worked for the first copy, but resulted in the second copy doing a
32-bit read at an address that was only 2-byte aligned.
Jarkko is actually fixing the fragile code in the TPM driver, but since
this is an excellent example of why we absolutely must not use a generic
memcpy for IO accesses, _and_ an IO-specific one really should strive to
align the IO accesses, let's do exactly that.
Side note: Jarkko also noted that the driver had been used on ARM
platforms, and had worked. That was because on 32-bit ARM, memcpy_*io()
ends up always doing byte accesses, and on 64-bit ARM it first does byte
accesses to align to 8-byte boundaries, and then does 8-byte accesses
for the bulk.
So ARM actually worked by design, and the x86 case worked by pure luck.
We *might* want to make x86-64 do the 8-byte case too. That should be a
pretty straightforward extension, but let's do one thing at a time. And
generally MMIO accesses aren't really all that performance-critical, as
shown by the fact that for a long time we just did them a byte at a
time, and very few people ever noticed.
Reported-and-tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: David Laight <David.Laight@aculab.com>
Fixes: 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KASLR may randomly choose a range which is located in movable memory
regions. As a result, this will break memory hotplug and make the
movable memory chosen by KASLR immovable.
Therefore, limit KASLR to choose memory regions in the immovable range
after consulting the SRAT table.
[ bp:
- Rewrite commit message.
- Trim comments.
]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: caoj.fnst@cn.fujitsu.com
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-8-fanc.fnst@cn.fujitsu.com
Parse SRAT for the immovable memory regions and use that information to
control which offset KASLR selects so that it doesn't overlap with any
movable region.
[ bp:
- Move struct mem_vector where it is visible so that it builds.
- Correct comments.
- Rewrite commit message.
]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: <caoj.fnst@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <indou.takao@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: <kasong@redhat.com>
Cc: <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <msys.mizuma@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-7-fanc.fnst@cn.fujitsu.com
The RSDP is needed by KASLR so parse it early and save it in
boot_params.acpi_rsdp_addr, before KASLR setup runs.
RSDP is needed by other kernel facilities so have the parsing code
built-in instead of a long "depends on" line in Kconfig.
[ bp:
- Trim commit message and comments
- Add CONFIG_ACPI dependency in the Makefile
- Move ->acpi_rsdp_addr assignment with the rest of boot_params massaging in extract_kernel().
]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-6-fanc.fnst@cn.fujitsu.com
The immovable memory ranges information in the SRAT table is necessary
to fix the issue of KASLR not paying attention to movable memory regions
when selecting the offset. Therefore, SRAT needs to be parsed.
Depending on the boot: KEXEC/EFI/BIOS, the methods to compute RSDP are
different. When booting from EFI, the EFI table points to the RSDP. So
iterate over the EFI system tables in order to find the RSDP.
[ bp:
- Heavily massage commit message
- Trim comments
- Move the CONFIG_ACPI ifdeffery into the Makefile.
]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-4-fanc.fnst@cn.fujitsu.com
KASLR may randomly choose offsets which are located in movable memory
regions resulting in the movable memory becoming immovable.
The ACPI SRAT (System/Static Resource Affinity Table) describes memory
ranges including ranges of memory provided by hot-added memory devices.
In order to access SRAT, one needs the Root System Description Pointer
(RSDP) with which to find the Root/Extended System Description Table
(R/XSDT) which then contains the system description tables of which SRAT
is one of.
In case the RSDP address has been passed on the command line (kexec-ing
a second kernel) parse it from there.
[ bp: Rewrite the commit message and cleanup the code. ]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-3-fanc.fnst@cn.fujitsu.com
Copy kstrtoull() and the other necessary functions from lib/kstrtox.c
to boot/string.c so that code in boot/ can use kstrtoull() and the old
simple_strtoull() can gradually be phased out.
Using div_u64() from math64.h directly will cause the dividend to be
handled as a 64-bit value and cause the infamous __divdi3 linker error
due to gcc trying to use its library function for the 64-bit division.
Therefore, separate the dividend into an upper and lower part.
[ bp: Rewrite commit message. ]
Signed-off-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: caoj.fnst@cn.fujitsu.com
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: indou.takao@jp.fujitsu.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: kasong@redhat.com
Cc: Kees Cook <keescook@chromium.org>
Cc: msys.mizuma@gmail.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190123110850.12433-2-fanc.fnst@cn.fujitsu.com
The load_microcode_amd() function searches for microcode patches and
attempts to apply a microcode patch if it is of different level than the
currently installed level.
While the processor won't actually load a level that is less than
what is already installed, the logic wrongly returns UCODE_NEW thus
signaling to its caller reload_store() that a late loading should be
attempted.
If the file-system contains an older microcode revision than what is
currently running, such a late microcode reload can result in these
misleading messages:
x86/CPU: CPU features have changed after loading microcode, but might not take effect.
x86/CPU: Please consider either early loading through initrd/built-in or a potential BIOS update.
These messages were issued on a system where SME/SEV are not
enabled by the BIOS (MSR C001_0010[23] = 0b) because during boot,
early_detect_mem_encrypt() is called and cleared the SME and SEV
features in this case.
However, after the wrong late load attempt, get_cpu_cap() is called and
reloads the SME and SEV feature bits, resulting in the messages.
Update the microcode level check to not attempt microcode loading if the
current level is greater than(!) and not only equal to the current patch
level.
[ bp: massage commit message. ]
Fixes: 2613f36ed9 ("x86/microcode: Attempt late loading only when new microcode is present")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/154894518427.9406.8246222496874202773.stgit@tlendack-t1.amdoffice.net
With the following commit:
73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled. However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.
The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case. So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.
Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:
1) /sys/devices/system/cpu/smt/control showed "on" instead of
"notsupported"; and
2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
I'd propose that we instead consider #1 above to not actually be a
problem. Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later. So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).
The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.
So fix it by:
a) reverting the original "fix" and its followup fix:
73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
bc2d8d262c ("cpu/hotplug: Fix SMT supported evaluation")
and
b) changing vmx_vm_init() to query the actual current SMT state --
instead of the sysfs control value -- to determine whether the L1TF
warning is needed. This also requires the 'sched_smt_present'
variable to exported, instead of 'cpu_smt_control'.
Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
ENTRY is intended for functions and shall be paired with ENDPROC. ENTRY
also aligns symbols which creates unnecessary holes between data.
So drop ENTRY from saved_eip in wakeup_32 and many saved_* in wakeup_64,
as these symbols are local only.
One could've used SYM_DATA_LOCAL for these symbols, but it was
discouraged earlier:
https://lkml.kernel.org/r/20170427124310.GC23352@amd
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190130124711.12463-3-jslaby@suse.cz
Remove the ifdeffery in the breakpoint parsing arch_build_bp_info() by
adding a within_kprobe_blacklist() stub for the !CONFIG_KPROBES case.
It is returning true when kprobes are not enabled to mean that any
address is within the kprobes blacklist on such kernels and thus not
allow kernel breakpoints on non-kprobes kernels.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190127131237.4557-1-bp@alien8.de
With the default SPEC_STORE_BYPASS_SECCOMP/SPEC_STORE_BYPASS_PRCTL mode,
the TIF_SSBD bit will be inherited when a new task is fork'ed or cloned.
It will also remain when a new program is execve'ed.
Only certain class of applications (like Java) that can run on behalf of
multiple users on a single thread will require disabling speculative store
bypass for security purposes. Those applications will call prctl(2) at
startup time to disable SSB. They won't rely on the fact the SSB might have
been disabled. Other applications that don't need SSBD will just move on
without checking if SSBD has been turned on or not.
The fact that the TIF_SSBD is inherited across execve(2) boundary will
cause performance of applications that don't need SSBD but their
predecessors have SSBD on to be unwittingly impacted especially if they
write to memory a lot.
To remedy this problem, a new PR_SPEC_DISABLE_NOEXEC argument for the
PR_SET_SPECULATION_CTRL option of prctl(2) is added to allow applications
to specify that the SSBD feature bit on the task structure should be
cleared whenever a new program is being execve'ed.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: https://lkml.kernel.org/r/1547676096-3281-1-git-send-email-longman@redhat.com
gdt64 represents the content of GDTR under x86-64, which actually needs
10 bytes only, ".long" & ".word" is superfluous.
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <bp@alien8.de>
Cc: <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20190123100014.23721-1-caoj.fnst@cn.fujitsu.com
There is no early_trap_pf_init() implementation, hence remove this useless
declaration.
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1546591579-23502-1-git-send-email-kernelfans@gmail.com
show_ldttss() shifts desc.base2 by 24 bit, but base2 is 8 bits of a
bitfield in a u16.
Due to the really great idea of integer promotion in C99 base2 is promoted
to an int, because that's the standard defined behaviour when all values
which can be represented by base2 fit into an int.
Now if bit 7 is set in desc.base2 the result of the shift left by 24 makes
the resulting integer negative and the following conversion to unsigned
long legitmately sign extends first causing the upper bits 32 bits to be
set in the result.
Fix this by casting desc.base2 to unsigned long before the shift.
Detected by CoverityScan, CID#1475635 ("Unintended sign extension")
[ tglx: Reworded the changelog a bit as I actually had to lookup
the standard (again) to decode the original one. ]
Fixes: a1a371c468 ("x86/fault: Decode page fault OOPSes better")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: kernel-janitors@vger.kernel.org
Link: https://lkml.kernel.org/r/20181222191116.21831-1-colin.king@canonical.com
In some old AMD KVM implementation, guest's EFER.LME bit is cleared by KVM
when the hypervsior detects that the guest sets CR0.PG to 0. This causes
the guest OS to reboot when it tries to return from 32-bit trampoline code
because the CPU is in incorrect state: CR4.PAE=1, CR0.PG=1, CS.L=1, but
EFER.LME=0. As a precaution, set EFER.LME=1 as part of long mode
activation procedure. This extra step won't cause any harm when Linux is
booted on a bare-metal machine.
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: bp@alien8.de
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20190104054411.12489-1-wei@redhat.com
The "cpu" local variable became unused after
a2055abe9c ("x86/mm: Pass flush_tlb_info to flush_tlb_others() etc").
Remove it.
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1548747417-33551-1-git-send-email-zhangshaokun@hisilicon.com
Add the Atom Tremont model number to the Intel family list.
[ Tony: Also update comment at head of file to say "_X" suffix is
also used for microserver parts. ]
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190125195902.17109-4-tony.luck@intel.com
In preparation to enable -Wimplicit-fallthrough by default, mark
switch-case statements where fall-through is intentional, explicitly in
order to fix a couple of -Wimplicit-fallthrough warnings.
Warning level 3 was used: -Wimplicit-fallthrough=3.
[ bp: Massasge and trim commit message. ]
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jacek Tomaka <jacek.tomaka@poczta.fm>
Cc: Jia Zhang <qianyue.zj@alibaba-inc.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190125184917.GA7289@embeddedor
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-01-29
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Teach verifier dead code removal, this also allows for optimizing /
removing conditional branches around dead code and to shrink the
resulting image. Code store constrained architectures like nfp would
have hard time doing this at JIT level, from Jakub.
2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
code generation for 32-bit sub-registers. Evaluation shows that this
can result in code reduction of ~5-20% compared to 64 bit-only code
generation. Also add implementation for most JITs, from Jiong.
3) Add support for __int128 types in BTF which is also needed for
vmlinux's BTF conversion to work, from Yonghong.
4) Add a new command to bpftool in order to dump a list of BPF-related
parameters from the system or for a specific network device e.g. in
terms of available prog/map types or helper functions, from Quentin.
5) Add AF_XDP sock_diag interface for querying sockets from user
space which provides information about the RX/TX/fill/completion
rings, umem, memory usage etc, from Björn.
6) Add skb context access for skb_shared_info->gso_segs field, from Eric.
7) Add support for testing flow dissector BPF programs by extending
existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.
8) Split BPF kselftest's test_verifier into various subgroups of tests
in order better deal with merge conflicts in this area, from Jakub.
9) Add support for queue/stack manipulations in bpftool, from Stanislav.
10) Document BTF, from Yonghong.
11) Dump supported ELF section names in libbpf on program load
failure, from Taeung.
12) Silence a false positive compiler warning in verifier's BTF
handling, from Peter.
13) Fix help string in bpftool's feature probing, from Prashant.
14) Remove duplicate includes in BPF kselftests, from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>