Commit Graph

29358 Commits

Author SHA1 Message Date
Tom Lendacky 07344b15a9 x86/boot: Fix SEV boot failure from change to __PHYSICAL_MASK_SHIFT
In arch/x86/boot/compressed/kaslr_64.c, CONFIG_AMD_MEM_ENCRYPT support was
initially #undef'd to support SME with minimal effort.  When support for
SEV was added, the #undef remained and some minimal support for setting the
encryption bit was added for building identity mapped pagetable entries.

Commit b83ce5ee91 ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
changed __PHYSICAL_MASK_SHIFT from 46 to 52 in support of 5-level paging.
This change resulted in SEV guests failing to boot because the encryption
bit was no longer being automatically masked out.  The compressed boot
path now requires sme_me_mask to be defined in order for the pagetable
functions, such as pud_present(), to properly mask out the encryption bit
(currently bit 47) when evaluating pagetable entries.

Add an sme_me_mask variable in arch/x86/boot/compressed/mem_encrypt.S,
which is set when SEV is active, delete the #undef CONFIG_AMD_MEM_ENCRYPT
from arch/x86/boot/compressed/kaslr_64.c and use sme_me_mask when building
the identify mapped pagetable entries.

Fixes: b83ce5ee91 ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180327220711.8702.55842.stgit@tlendack-t1.amdoffice.net
2018-03-28 10:42:57 +02:00
Andrew Banman 151ad17fbe x86/platform/uv/BAU: Add APIC idt entry
BAU uses the old alloc_initr_gate90 method to setup its interrupt. This
fails silently as the BAU vector is in the range of APIC vectors that are
registered to the spurious interrupt handler. As a consequence BAU
broadcasts are not handled, and the broadcast source CPU hangs.

Update BAU to use new idt structure.

Fixes: dc20b2d526 ("x86/idt: Move interrupt gate initialization to IDT code")
Signed-off-by: Andrew Banman <abanman@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1522188546-196177-1-git-send-email-abanman@hpe.com
2018-03-28 10:40:55 +02:00
Eric Dumazet 9b9a51354c x86/msr: Make rdmsrl_safe_on_cpu() scheduling safe as well
When changing rdmsr_safe_on_cpu() to schedule, it was missed that
__rdmsr_safe_on_cpu() was also used by rdmsrl_safe_on_cpu()

Make rdmsrl_safe_on_cpu() a wrapper instead of copy/pasting the code which
was added for the completion handling.

Fixes: 07cde313b2 ("x86/msr: Allow rdmsr_safe_on_cpu() to schedule")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180328032233.153055-1-edumazet@google.com
2018-03-28 10:34:13 +02:00
Dave Airlie 2b4f44eec2 Linux 4.16-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJauCZfAAoJEHm+PkMAQRiGWGUH/2rhdQDkoJpYWnjaQkolECG8
 MxpGE7nmIIHxQcbSDdHTGJ8IhVm6Z5wZ7ym/PwCDTT043Y1y341sJrIwL2/nTG6d
 HVidk8hFvgN6QzlzVAHT3ZZMII/V9Zt+VV5SUYLGnPAVuJNHo/6uzWlTU5g+NTFo
 IquFDdQUaGBlkKqby+NoAFnkV1UAIkW0g22cfvPnlO5GMer0gusGyVNvVp7TNj3C
 sqj4Hvt3RMDLMNe9RZ2pFTiOD096n8FWpYftZneUTxFImhRV3Jg5MaaYZm9SI3HW
 tXrv/LChT/F1mi5Pkx6tkT5Hr8WvcrwDMJ4It1kom10RqWAgjxIR3CMm448ileY=
 =YKUG
 -----END PGP SIGNATURE-----

Backmerge tag 'v4.16-rc7' into drm-next

Linux 4.16-rc7

This was requested by Daniel, and things were getting
a bit hard to reconcile, most of the conflicts were
trivial though.
2018-03-28 14:30:41 +10:00
Eric Dumazet 67bbd7a8d6 x86/cpuid: Allow cpuid_read() to schedule
High latencies can be observed caused by a daemon periodically reading
CPUID on all cpus. On KASAN enabled kernels ~10ms latencies can be
observed. Even without KASAN, sending an IPI to a CPU, which is in a deep
sleep state or in a long hard IRQ disabled section, waiting for the answer
can consume hundreds of microseconds.

cpuid_read() is invoked in preemptible context, so it can be converted to
sleep instead of busy wait.

Switching to smp_call_function_single_async() and a completion allows to
reschedule and reduces CPU usage and latencies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lkml.kernel.org/r/20180323215818.127774-2-edumazet@google.com
2018-03-27 12:01:48 +02:00
Eric Dumazet 07cde313b2 x86/msr: Allow rdmsr_safe_on_cpu() to schedule
High latencies can be observed caused by a daemon periodically reading
various MSR on all cpus. On KASAN enabled kernels ~10ms latencies can be
observed simply reading one MSR. Even without KASAN, sending an IPI to a
CPU, which is in a deep sleep state or in a long hard IRQ disabled section,
waiting for the answer can consume hundreds of microseconds.

All usage sites are in preemptible context, convert rdmsr_safe_on_cpu() to
use a completion instead of busy polling.

Overall daemon cpu usage was reduced by 35 %, and latencies caused by
msr_read() disappeared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lkml.kernel.org/r/20180323215818.127774-1-edumazet@google.com
2018-03-27 12:01:47 +02:00
Kirill A. Shutemov 547edaca24 x86/mm: Update comment in detect_tme() regarding x86_phys_bits
As Kai pointed out, the primary reason for adjusting x86_phys_bits is to
reflect that the the address space is reduced and not the ability to
communicate the available physical address space to virtual machines.

Suggested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org
Link: https://lkml.kernel.org/r/20180315134907.9311-2-kirill.shutemov@linux.intel.com
2018-03-27 11:49:58 +02:00
Andy Shevchenko 47a9973d3e x86/PCI: Fix a potential regression when using dmi_get_bios_year()
dmi_get_bios_year() may return 0 when it cannot parse
the BIOS date string. Previously this has been checked in
pci_acpi_crs_quirks().

Update the code to restore old behaviour.

Reported-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: linux-acpi@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Fixes: 69c42d493d ("x86/pci: Simplify code by using the new dmi_get_bios_year() helper")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 11:33:16 +02:00
Alexey Dobriyan bd6271039e x86/alternatives: Fixup alternative_call_2
The following pattern fails to compile while the same pattern
with alternative_call() does:

	if (...)
		alternative_call_2(...);
	else
		alternative_call_2(...);

as it expands into

	if (...)
	{
	};	<===
	else
	{
	};

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180114120504.GA11368@avx2
2018-03-27 09:47:53 +02:00
David Rientjes fc5d1073ca x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logic
node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: f03574f2d5 ("x86-32, mm: Rip out x86_32 NUMA remapping code")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 08:45:02 +02:00
Ingo Molnar 0bc91d4ba7 Linux 4.16-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJauCZfAAoJEHm+PkMAQRiGWGUH/2rhdQDkoJpYWnjaQkolECG8
 MxpGE7nmIIHxQcbSDdHTGJ8IhVm6Z5wZ7ym/PwCDTT043Y1y341sJrIwL2/nTG6d
 HVidk8hFvgN6QzlzVAHT3ZZMII/V9Zt+VV5SUYLGnPAVuJNHo/6uzWlTU5g+NTFo
 IquFDdQUaGBlkKqby+NoAFnkV1UAIkW0g22cfvPnlO5GMer0gusGyVNvVp7TNj3C
 sqj4Hvt3RMDLMNe9RZ2pFTiOD096n8FWpYftZneUTxFImhRV3Jg5MaaYZm9SI3HW
 tXrv/LChT/F1mi5Pkx6tkT5Hr8WvcrwDMJ4It1kom10RqWAgjxIR3CMm448ileY=
 =YKUG
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc7' into x86/mm, to fix up conflict

 Conflicts:
	arch/x86/mm/init_64.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 08:43:39 +02:00
Stephane Eranian 71eb9ee959 perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
this patch fix a bug in how the pebs->real_ip is handled in the PEBS
handler. real_ip only exists in Haswell and later processor. It is
actually the eventing IP, i.e., where the event occurred. As opposed
to the pebs->ip which is the PEBS interrupt IP which is always off
by one.

The problem is that the real_ip just like the IP needs to be fixed up
because PEBS does not record all the machine state registers, and
in particular the code segement (cs). This is why we have the set_linear_ip()
function. The problem was that set_linear_ip() was only used on the pebs->ip
and not the pebs->real_ip.

We have profiles which ran into invalid callstacks because of this.
Here is an example:

 .....  0: ffffffffffffff80 recent entry, marker kernel v
 .....  1: 000000000040044d <= user address in kernel space!
 .....  2: fffffffffffffe00 marker enter user v
 .....  3: 000000000040044d
 .....  4: 00000000004004b6 oldest entry

Debugging output in get_perf_callchain():

 [  857.769909] CALLCHAIN: CPU8 ip=40044d regs->cs=10 user_mode(regs)=0

The problem is that the kernel entry in 1: points to a user level
address. How can that be?

The reason is that with PEBS sampling the instruction that caused the event
to occur and the instruction where the CPU was when the interrupt was posted
may be far apart. And sometime during that time window, the privilege level may
change. This happens, for instance, when the PEBS sample is taken close to a
kernel entry point. Here PEBS, eventing IP (real_ip) captured a user level
instruction. But by the time the PMU interrupt fired, the processor had already
entered kernel space. This is why the debug output shows a user address with
user_mode() false.

The problem comes from PEBS not recording the code segment (cs) register.
The register is used in x86_64 to determine if executing in kernel vs user
space. This is okay because the kernel has a software workaround called
set_linear_ip(). But the issue in setup_pebs_sample_data() is that
set_linear_ip() is never called on the real_ip value when it is available
(Haswell and later) and precise_ip > 1.

This patch fixes this problem and eliminates the callchain discrepancy.

The patch restructures the code around set_linear_ip() to minimize the number
of times the IP has to be set.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1521788507-10231-1-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 08:27:27 +02:00
Davidlohr Bueso 631fe154ed perf/x86: Update rdpmc_always_available static key to the modern API
No changes in refcount semantics -- use DEFINE_STATIC_KEY_FALSE()
for initialization and replace:

  static_key_slow_inc|dec()   =>   static_branch_inc|dec()
  static_key_false()          =>   static_branch_unlikely()

Added a '_key' suffix to rdpmc_always_available, for better self-documentation.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Link: http://lkml.kernel.org/r/20180326210929.5244-5-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-27 07:53:00 +02:00
Ivan Gorinov 4e07db9c8d x86/devicetree: Use CPU description from Device Tree
Current x86 Device Tree implementation does not support multiprocessing.
Use new DT bindings to describe the processors.

Signed-off-by: Ivan Gorinov <ivan.gorinov@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Link: https://lkml.kernel.org/r/c291fb2cef51b730b59916d7745be0eaa4378c6c.1521753738.git.ivan.gorinov@intel.com
2018-03-26 15:13:32 +02:00
David Rientjes e25283bf83 x86/apic: Finish removing unused callbacks
The ->cpu_mask_to_apicid() and ->vector_allocation_domain() callbacks are
now unused, so remove them.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: baab1e84b1 ("x86/apic: Remove unused callbacks")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803251403540.80485@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-26 08:49:38 +02:00
Linus Torvalds d2862360bf Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 and PTI fixes from Ingo Molnar:
 "Misc fixes:

   - fix EFI pagetables freeing

   - fix vsyscall pagetable setting on Xen PV guests

   - remove ancient CONFIG_X86_PPRO_FENCE=y - x86 is TSO again

   - fix two binutils (ld) development version related incompatibilities

   - clean up breakpoint handling

   - fix an x86 self-test"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/64: Don't use IST entry for #BP stack
  x86/efi: Free efi_pgd with free_pages()
  x86/vsyscall/64: Use proper accessor to update P4D entry
  x86/cpu: Remove the CONFIG_X86_PPRO_FENCE=y quirk
  x86/boot/64: Verify alignment of the LOAD segment
  x86/build/64: Force the linker to use 2MB page size
  selftests/x86/ptrace_syscall: Fix for yet more glibc interference
2018-03-25 07:36:02 -10:00
Linus Torvalds eaf67993f5 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Misc kernel side fixes.

  Generic:
   - cgroup events counting fix

  x86:
   - Intel PMU truncated-parameter fix

   - RDPMC fix

   - API naming fix/rename

   - uncore driver big-hardware PCI enumeration fix

   - uncore driver filter constraint fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/cgroup: Fix child event counting bug
  perf/x86/intel/uncore: Fix multi-domain PCI CHA enumeration bug on Skylake servers
  perf/x86/intel: Rename confusing 'freerunning PEBS' API and implementation to 'large PEBS'
  perf/x86/intel/uncore: Add missing filter constraint for SKX CHA event
  perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()
  perf/x86/intel: Disable userspace RDPMC usage for large PEBS
2018-03-25 07:27:32 -10:00
Sven Wegener e847f6aaf6 x86/purgatory: Avoid creating stray .<pid>.d files, remove -MD from KBUILD_CFLAGS
The kernel build system already takes care of generating the dependency
files. Having the additional -MD in KBUILD_CFLAGS leads to stray
.<pid>.d files in the build directory when we call the cc-option macro.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LNX.2.21.1803242219380.30139@titan.int.lan.stealer.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-25 11:04:02 +02:00
Ingo Molnar ea2301b622 Merge branch 'linus' into x86/dma, to resolve a conflict with upstream
Conflicts:
	arch/x86/mm/init_64.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-24 09:25:26 +01:00
Ingo Molnar 7054e4e0b1 Merge branch 'perf/urgent' into perf/core, to pick up fixes
With the cherry-picked perf/urgent commit merged separately we can now
merge all the fixes without conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-24 09:21:47 +01:00
Andy Lutomirski d8ba61ba58 x86/entry/64: Don't use IST entry for #BP stack
There's nothing IST-worthy about #BP/int3.  We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.

Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2018-03-23 21:10:36 +01:00
Waiman Long 06ace26f4e x86/efi: Free efi_pgd with free_pages()
The efi_pgd is allocated as PGD_ALLOCATION_ORDER pages and therefore must
also be freed as PGD_ALLOCATION_ORDER pages with free_pages().

Fixes: d9e9a64180 ("x86/mm/pti: Allocate a separate user PGD")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1521746333-19593-1-git-send-email-longman@redhat.com
2018-03-23 20:18:31 +01:00
Thomas Gleixner ea89c06548 x86/tsc: Get rid of rdtscll()
Commit 99770737ca ("x86/asm/tsc: Add rdtscll() merge helper") added
rdtscll() in August 2015 along with the comment:

 /* Deprecated, keep it for a cycle for easier merging: */

12 cycles later it's really overdue for removal.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-03-23 20:07:54 +01:00
Sean Christopherson 9d1887ef32 KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
Segment registers must be synchronized prior to any code that may
trigger a call to emulation_required()/guest_state_valid(), e.g.
vmx_set_cr0().  Because preparing vmcs02 writes segmentation fields
directly, i.e. doesn't use vmx_set_segment(), emulation_required
will not be re-evaluated when synchronizing the segment registers,
which can result in L0 incorrectly starting emulation of L2.

Fixes: 8665c3f973 ("KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Move all of prepare_vmcs02_full earlier, not just segment registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-23 18:26:21 +01:00
David S. Miller 03fe2debbb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fun set of conflict resolutions here...

For the mac80211 stuff, these were fortunately just parallel
adds.  Trivially resolved.

In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.

In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.

The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.

The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:

====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch.  This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
            drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
            (IB/mlx5: Fix cleanup order on unload) added to for-rc and
            commit b5ca15ad7e (IB/mlx5: Add proper representors support)
            add as part of the devel cycle both needed to modify the
            init/de-init functions used by mlx5.  To support the new
            representors, the new functions added by the cleanup patch
            needed to be made non-static, and the init/de-init list
            added by the representors patch needed to be modified to
            match the init/de-init list changes made by the cleanup
            patch.
    Updates:
            drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
            prototypes added by representors patch to reflect new function
            names as changed by cleanup patch
            drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
            stage list to match new order from cleanup patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 11:31:58 -04:00
Linus Torvalds f36b7534b8 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, thp: do not cause memcg oom for thp
  mm/vmscan: wake up flushers for legacy cgroups too
  Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
  mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
  mm/thp: do not wait for lock_page() in deferred_split_scan()
  mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
  x86/mm: implement free pmd/pte page interfaces
  mm/vmalloc: add interfaces to free unmapped page table
  h8300: remove extraneous __BIG_ENDIAN definition
  hugetlbfs: check for pgoff value overflow
  lockdep: fix fs_reclaim warning
  MAINTAINERS: update Mark Fasheh's e-mail
  mm/mempolicy.c: avoid use uninitialized preferred_node
2018-03-22 18:48:43 -07:00
Linus Torvalds 8401c72c59 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "Two regression fixes, two bug fixes for older issues, two fixes for
  new functionality added this cycle that have userspace ABI concerns,
  and a small cleanup. These have appeared in a linux-next release and
  have a build success report from the 0day robot.

   * The 4.16 rework of altmap handling led to some configurations
     leaking page table allocations due to freeing from the altmap
     reservation rather than the page allocator.

     The impact without the fix is leaked memory and a WARN() message
     when tearing down libnvdimm namespaces. The rework also missed a
     place where error handling code needed to be removed that can lead
     to a crash if devm_memremap_pages() fails.

   * acpi_map_pxm_to_node() had a latent bug whereby it could
     misidentify the closest online node to a given proximity domain.

   * Block integrity handling was reworked several kernels back to allow
     calling add_disk() after setting up the integrity profile.

     The nd_btt and nd_blk drivers are just now catching up to fix
     automatic partition detection at driver load time.

   * The new peristence_domain attribute, a platform indicator of
     whether cpu caches are powerfail protected for example, is meant to
     be a single value enum and not a set of flags.

     This oversight was caught while reviewing new userspace code in
     libndctl to communicate the attribute.

     Fix this new enabling up so that we are not stuck with an unwanted
     userspace ABI"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, nfit: fix persistence domain reporting
  libnvdimm, region: hide persistence_domain when unknown
  acpi, numa: fix pxm to online numa node associations
  x86, memremap: fix altmap accounting at free
  libnvdimm: remove redundant assignment to pointer 'dev'
  libnvdimm, {btt, blk}: do integrity setup before add_disk()
  kernel/memremap: Remove stale devres_free() call
2018-03-22 18:37:49 -07:00
Toshi Kani 28ee90fe60 x86/mm: implement free pmd/pte page interfaces
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
clear a given pud/pmd entry and free up lower level page table(s).

The address range associated with the pud/pmd entry must have been
purged by INVLPG.

Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
Fixes: e61ce6ade4 ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 17:07:01 -07:00
Toshi Kani b6bdb7517c mm/vmalloc: add interfaces to free unmapped page table
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
create pud/pmd mappings.  A kernel panic was observed on arm64 systems
with Cortex-A75 in the following steps as described by Hanjun Guo.

 1. ioremap a 4K size, valid page table will build,
 2. iounmap it, pte0 will set to 0;
 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
    then set the a new value for pmd;
 4. pte0 is leaked;
 5. CPU may meet exception because the old pmd is still in TLB,
    which will lead to kernel panic.

This panic is not reproducible on x86.  INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86.  x86
still has memory leak.

The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:

 - The iounmap() path is shared with vunmap(). Since vmap() only
   supports pte mappings, making vunmap() to free a pte page is an
   overhead for regular vmap users as they do not need a pte page freed
   up.

 - Checking if all entries in a pte page are cleared in the unmap path
   is racy, and serializing this check is expensive.

 - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
   Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
   purge.

Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.

This patch implements their stub functions on x86 and arm64, which work
as workaround.

[akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade4 ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-22 17:07:01 -07:00
Linus Torvalds c4f4d2f917 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Always validate XFRM esn replay attribute, from Florian Westphal.

 2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long.

 3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul
    Triebitz.

 4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel
    Axtens.

 5) Fix some interrupt handling issues in e1000e driver, from Benjamin
    Poitier.

 6) Use strlcpy() in several ethtool get_strings methods, from Florian
    Fainelli.

 7) Fix rhlist dup insertion, from Paul Blakey.

 8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev.

 9) Fix driver unload crash when link is up in smsc911x, from Jeremy
    Linton.

10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric
    Dumazet.

11) Need to purge the write queue when TCP connections are aborted,
    otherwise userspace using MSG_ZEROCOPY can't close the fd. From
    Soheil Hassas Yeganeh.

12) Fix double free in error path of team driver, from Arkadi
    Sharshevsky.

13) Filter fixes for hv_netvsc driver, from Stephen Hemminger.

14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo
    Bianconi.

15) Properly filter out unsupported feature flags in macvlan driver,
    from Shannon Nelson.

16) Don't request loading the diag module for a protocol if the protocol
    itself is not even registered. From Xin Long.

17) If datagram connect fails in ipv6, make sure the socket state is
    consistent afterwards. From Paolo Abeni.

18) Use after free in qed driver, from Dan Carpenter.

19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the
    entry. From Sabrina Dubroca.

20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins.

21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita.

22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel.

23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti.

24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli.

25) Memory leak in gemini driver, from Igor Pylypiv.

26) IDR leaks in error paths of tcf_*_init() functions, from Davide
    Caratti.

27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun.

28) Missing dev_put() in error path of macsec_newlink(), from Dan
    Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits)
  macsec: missing dev_put() on error in macsec_newlink()
  net: dsa: Fix functional dsa-loop dependency on FIXED_PHY
  hv_netvsc: common detach logic
  hv_netvsc: change GPAD teardown order on older versions
  hv_netvsc: use RCU to fix concurrent rx and queue changes
  hv_netvsc: disable NAPI before channel close
  net/ipv6: Handle onlink flag with multipath routes
  ppp: avoid loop in xmit recursion detection code
  ipv6: sr: fix NULL pointer dereference when setting encap source address
  ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
  net: aquantia: driver version bump
  net: aquantia: Implement pci shutdown callback
  net: aquantia: Allow live mac address changes
  net: aquantia: Add tx clean budget and valid budget handling logic
  net: aquantia: Change inefficient wait loop on fw data reads
  net: aquantia: Fix a regression with reset on old firmware
  net: aquantia: Fix hardware reset when SPI may rarely hangup
  s390/qeth: on channel error, reject further cmd requests
  s390/qeth: lock read device while queueing next buffer
  s390/qeth: when thread completes, wake up all waiters
  ...
2018-03-22 14:10:29 -07:00
Linus Torvalds 32d43cd391 kvm/x86: fix icebp instruction handling
The undocumented 'icebp' instruction (aka 'int1') works pretty much like
'int3' in the absense of in-circuit probing equipment (except,
obviously, that it raises #DB instead of raising #BP), and is used by
some validation test-suites as such.

But Andy Lutomirski noticed that his test suite acted differently in kvm
than on bare hardware.

The reason is that kvm used an inexact test for the icebp instruction:
it just assumed that an all-zero VM exit qualification value meant that
the VM exit was due to icebp.

That is not unlike the guess that do_debug() does for the actual
exception handling case, but it's purely a heuristic, not an absolute
rule.  do_debug() does it because it wants to ascribe _some_ reasons to
the #DB that happened, and an empty %dr6 value means that 'icebp' is the
most likely casue and we have no better information.

But kvm can just do it right, because unlike the do_debug() case, kvm
actually sees the real reason for the #DB in the VM-exit interruption
information field.

So instead of relying on an inexact heuristic, just use the actual VM
exit information that says "it was 'icebp'".

Right now the 'icebp' instruction isn't technically documented by Intel,
but that will hopefully change.  The special "privileged software
exception" information _is_ actually mentioned in the Intel SDM, even
though the cause of it isn't enumerated.

Reported-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-20 14:58:34 -07:00
Boris Ostrovsky 31ad7f8e7d x86/vsyscall/64: Use proper accessor to update P4D entry
Writing to it directly does not work for Xen PV guests.

Fixes: 49275fef98 ("x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180319143154.3742-1-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 12:00:53 +01:00
Peter Zijlstra d0266046ad x86: Remove FAST_FEATURE_TESTS
Since we want to rely on static branches to avoid speculation, remove
any possible fallback code for static_cpu_has.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20180319154717.705383007@infradead.org
2018-03-20 10:58:03 +01:00
Peter Zijlstra e501ce957a x86: Force asm-goto
We want to start using asm-goto to guarantee the absence of dynamic
branches (and thus speculation).

A primary prerequisite for this is of course that the compiler
supports asm-goto. This effecively lifts the minimum GCC version to
build an x86 kernel to gcc-4.5.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20180319201327.GJ4043@hirez.programming.kicks-ass.net
2018-03-20 10:58:02 +01:00
Will Deacon 4c0ca49e6d Merge branch 'siginfo-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace into aarch64/for-next/core
Pull in pending siginfo changes from Eric Biederman as we depend on
the definition of FPE_FLTUNK for cleaning up our floating-point exception
signal delivery (which is currently broken and using FPE_FIXME).
2018-03-20 09:57:15 +00:00
Thomas Gleixner edc39c9b45 Merge branch 'linus' into x86/build to pick up dependencies 2018-03-20 10:56:18 +01:00
Christoph Hellwig c10f07aa27 dma/direct: Handle force decryption for DMA coherent buffers in common code
With that in place the generic DMA-direct routines can be used to
allocate non-encrypted bounce buffers, and the x86 SEV case can use
the generic swiotlb ops including nice features such as using CMA
allocations.

Note that I'm not too happy about using sev_active() in DMA-direct, but
I couldn't come up with a good enough name for a wrapper to make it
worth adding.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-14-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:59 +01:00
Christoph Hellwig b6e05477c1 dma/direct: Handle the memory encryption bit in common code
Give the basic phys_to_dma() and dma_to_phys() helpers a __-prefix and add
the memory encryption mask to the non-prefixed versions.  Use the
__-prefixed versions directly instead of clearing the mask again in
various places.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-13-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:59 +01:00
Christoph Hellwig e7de6c7cc2 dma/swiotlb: Remove swiotlb_set_mem_attributes()
Now that set_memory_decrypted() is always available we can just call it
directly.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-12-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:58 +01:00
Christoph Hellwig 178c568244 x86/dma: Remove dma_alloc_coherent_gfp_flags()
All dma_ops implementations used on x86 now take care of setting their own
required GFP_ masks for the allocation.  And given that the common code
now clears harmful flags itself that means we can stop the flags in all
the IOMMU implementations as well.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-10-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:58 +01:00
Christoph Hellwig 51c7eeba79 x86/dma/amd_gart: Use dma_direct_{alloc,free}()
This gains support for CMA allocations for the force_iommu case, and
cleans up the code a bit.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-7-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:57 +01:00
Christoph Hellwig f3c39d5104 x86/dma/amd_gart: Look at dev->coherent_dma_mask instead of GFP_DMA
We want to phase out looking at the magic GFP_DMA flag in the DMA mapping
routines, so switch the gart driver to use the dev->coherent_dma_mask
instead, which is used to select the GFP_DMA flag in the caller.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-6-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:57 +01:00
Christoph Hellwig 6e4bf58677 x86/dma: Use generic swiotlb_ops
The generic swiotlb DMA ops were based on the x86 ones and provide
equivalent functionality, so use them.

Also fix the sta2x11 case.  For that SOC the DMA map ops need an
additional physical to DMA address translations.  For swiotlb buffers
that is done throught the phys_to_dma helper, but the sta2x11_dma_ops
also added an additional translation on the return value from
x86_swiotlb_alloc_coherent, which is only correct if that functions
returns a direct allocation and not a swiotlb buffer.  With the
generic swiotlb and DMA-direct code phys_to_dma is not always used
and the separate sta2x11_dma_ops can be replaced with a simple
bit that marks if the additional physical to DMA address translation
is needed.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-5-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:57 +01:00
Christoph Hellwig fec777c385 x86/dma: Use DMA-direct (CONFIG_DMA_DIRECT_OPS=y)
The generic DMA-direct (CONFIG_DMA_DIRECT_OPS=y) implementation is now
functionally equivalent to the x86 nommu dma_map implementation, so
switch over to using it.

That includes switching from using x86_dma_supported in various IOMMU
drivers to use dma_direct_supported instead, which provides the same
functionality.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-4-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:56 +01:00
Christoph Hellwig 038d07a283 x86/dma: Remove dma_alloc_coherent_mask()
These days all devices (including the ISA fallback device) have a coherent
DMA mask set, so remove the workaround.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-3-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:56 +01:00
Ingo Molnar 3eb93ea327 Merge branch 'x86/mm' into x86/dma, to pick up dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:37 +01:00
Christoph Hellwig 5927145efd x86/cpu: Remove the CONFIG_X86_PPRO_FENCE=y quirk
There were only a few Pentium Pro multiprocessors systems where this
errata applied. They are more than 20 years old now, and we've slowly
dropped places which put the workarounds in and discouraged anyone
from enabling the workaround.

Get rid of it for good.

Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-2-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 10:01:05 +01:00
Kan Liang 320b0651f3 perf/x86/intel/uncore: Fix multi-domain PCI CHA enumeration bug on Skylake servers
The number of CHAs is miscalculated on multi-domain PCI Skylake server systems,
resulting in an uncore driver initialization error.

Gary Kroening explains:

 "For systems with a single PCI segment, it is sufficient to look for the
  bus number to change in order to determine that all of the CHa's have
  been counted for a single socket.

  However, for multi PCI segment systems, each socket is given a new
  segment and the bus number does NOT change.  So looking only for the
  bus number to change ends up counting all of the CHa's on all sockets
  in the system.  This leads to writing CPU MSRs beyond a valid range and
  causes an error in ivbep_uncore_msr_init_box()."

To fix this bug, query the number of CHAs from the CAPID6 register:
it should read bits 27:0 in the CAPID6 register located at
Device 30, Function 3, Offset 0x9C. These 28 bits form a bit vector
of available LLC slices and the CHAs that manage those slices.

Reported-by: Kroening, Gary <gary.kroening@hpe.com>
Tested-by: Kroening, Gary <gary.kroening@hpe.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: abanman@hpe.com
Cc: dimitri.sivanich@hpe.com
Cc: hpa@zytor.com
Cc: mike.travis@hpe.com
Cc: russ.anderson@hpe.com
Fixes: cd34cd97b7 ("perf/x86/intel/uncore: Add Skylake server uncore support")
Link: http://lkml.kernel.org/r/1520967094-13219-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:58:47 +01:00
Kan Liang 174afc3e7d perf/x86/intel: Rename confusing 'freerunning PEBS' API and implementation to 'large PEBS'
The 'freerunning PEBS' and 'large PEBS' are the same thing. Both of these
names appear in the code and in the API, which causes confusion.

Rename 'freerunning PEBS' to 'large PEBS' to unify the code,
which eliminates the confusion.

No functional change.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1520865937-22910-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:58:29 +01:00
Stephane Eranian e340895c9e perf/x86/intel/uncore: Add missing filter constraint for SKX CHA event
Adding a filter constraint for Intel Skylake CHA event
UNC_CHA_UPI_CREDITS_ACQUIRED (0x38).

The event supports core-id/thread-id and link filtering.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1520869294-14176-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:53:32 +01:00
Dan Carpenter e5ea9b54a0 perf/x86/intel: Don't accidentally clear high bits in bdw_limit_period()
We intended to clear the lowest 6 bits but because of a type bug we
clear the high 32 bits as well.  Andi says that periods are rarely more
than U32_MAX so this bug probably doesn't have a huge runtime impact.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 294fe0f52a ("perf/x86/intel: Add INST_RETIRED.ALL workarounds")
Link: http://lkml.kernel.org/r/20180317115216.GB4035@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:53:31 +01:00
Kan Liang 2c2a9bbe7f perf/x86/intel: Disable userspace RDPMC usage for large PEBS
Userspace RDPMC cannot possibly work for large PEBS, which was introduced in:

  b8241d2069 ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold)")

When the PEBS interrupt threshold is larger than one, there is no way
to get exact auto-reload times and value for userspace RDPMC.  Disable
the userspace RDPMC usage when large PEBS is enabled.

The only exception is when the PEBS interrupt threshold is 1, in which
case user-space RDPMC works well even with auto-reload events.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Fixes: b8241d2069 ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold)")
Link: http://lkml.kernel.org/r/1518474035-21006-6-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1af22eba24)
2018-03-20 08:52:58 +01:00
H.J. Lu c55b8550fa x86/boot/64: Verify alignment of the LOAD segment
Since the x86-64 kernel must be aligned to 2MB, refuse to boot the
kernel if the alignment of the LOAD segment isn't a multiple of 2MB.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOrR7xSJgUfiCoZLuqWUwymRxXPoGBW38%2BpN%3D9g%2ByKNhZw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:03:03 +01:00
H.J. Lu e3d03598e8 x86/build/64: Force the linker to use 2MB page size
Binutils 2.31 will enable -z separate-code by default for x86 to avoid
mixing code pages with data to improve cache performance as well as
security.  To reduce x86-64 executable and shared object sizes, the
maximum page size is reduced from 2MB to 4KB.  But x86-64 kernel must
be aligned to 2MB.  Pass -z max-page-size=0x200000 to linker to force
2MB page size regardless of the default page size used by linker.

Tested with Linux kernel 4.15.6 on x86-64.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOp4_%3D_8twdpTyAP2DhONOCeaTOsniJLoppzhoNptL8xzA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-20 08:03:03 +01:00
Ingo Molnar 134933e557 Linux 4.16-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqvCPYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGOaAH/171cgZGFEXSONxK
 3O1AAv61wN5K/ISMt6mnelWR6fZg195FarOx0Rnq7Ot8OWuVa8CGcyT4vX4Z7nb9
 SVMQKNMPCVQE4WCDOv6S0njChmRC0BxBoVJtTN9fhywdYgX1KcaTS/drMRHACF5n
 rB9eouMQScfMzKGAW08gp5NvEGJ6W1SLX7La3/u0751dYisdJSP7+vFZNxUrGXEA
 yIPOQjFu0Tfo8GXz/BwC678RZVzVLN0sE6+/vM7zNnoDlsRVkdDIVMo3UiVqm/NK
 B37/TlZz8CYoapoKnRRB5giXnSPDSXtsikbGy3mcy0u5imGe+ZgdjrdYSaLk31cR
 NVZY08k=
 =pu3X
 -----END PGP SIGNATURE-----

Merge tag 'v4.16-rc6' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-19 20:37:35 +01:00
Linus Torvalds 9e1909b9da Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "Another set of melted spectrum updates:

   - Iron out the last late microcode loading issues by actually
     checking whether new microcode is present and preventing the CPU
     synchronization to run into a timeout induced hang.

   - Remove Skylake C2 from the microcode blacklist according to the
     latest Intel documentation

   - Fix the VM86 POPF emulation which traps if VIP is set, but VIF is
     not. Enhance the selftests to catch that kind of issue

   - Annotate indirect calls/jumps for objtool on 32bit. This is not a
     functional issue, but for consistency sake its the right thing to
     do.

   - Fix a jump label build warning observed on SPARC64 which uses 32bit
     storage for the code location which is casted to 64 bit pointer w/o
     extending it to 64bit first.

   - Add two new cpufeature bits. Not really an urgent issue, but
     provides them for both x86 and x86/kvm work. No impact on the
     current kernel"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Fix CPU synchronization routine
  x86/microcode: Attempt late loading only when new microcode is present
  x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
  jump_label: Fix sparc64 warning
  x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
  x86/vm86/32: Fix POPF emulation
  selftests/x86/entry_from_vm86: Add test cases for POPF
  selftests/x86/entry_from_vm86: Exit with 1 if we fail
  x86/cpufeatures: Add Intel PCONFIG cpufeature
  x86/cpufeatures: Add Intel Total Memory Encryption cpufeature
2018-03-18 12:03:15 -07:00
Linus Torvalds df4fe17802 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
 "A single fix for vmalloc_fault() which uses p*d_huge() unconditionally
  whether CONFIG_HUGETLBFS is set or not. In case of CONFIG_HUGETLBFS=n
  this results in a crash as p*d_huge() returns 0 in that case"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix vmalloc_fault to use pXd_large
2018-03-18 12:01:14 -07:00
Khalid Aziz d84bb709aa signals, sparc: Add signal codes for ADI violations
SPARC M7 processor introduces a new feature - Application Data
Integrity (ADI). ADI allows MMU to  catch rogue accesses to memory.
When a rogue access occurs, MMU blocks the access and raises an
exception. In response to the exception, kernel sends the offending
task a SIGSEGV with si_code that indicates the nature of exception.
This patch adds three new signal codes specific to ADI feature:

1. ADI is not enabled for the address and task attempted to access
   memory using ADI
2. Task attempted to access memory using wrong ADI tag and caused
   a deferred exception.
3. Task attempted to access memory using wrong ADI tag and caused
   a precise exception.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 07:38:45 -07:00
Borislav Petkov bb8c13d61a x86/microcode: Fix CPU synchronization routine
Emanuel reported an issue with a hang during microcode update because my
dumb idea to use one atomic synchronization variable for both rendezvous
- before and after update - was simply bollocks:

  microcode: microcode_reload_late: late_cpus: 4
  microcode: __reload_late: cpu 2 entered
  microcode: __reload_late: cpu 1 entered
  microcode: __reload_late: cpu 3 entered
  microcode: __reload_late: cpu 0 entered
  microcode: __reload_late: cpu 1 left
  microcode: Timeout while waiting for CPUs rendezvous, remaining: 1

CPU1 above would finish, leave and the others will still spin waiting for
it to join.

So do two synchronization atomics instead, which makes the code a lot more
straightforward.

Also, since the update is serialized and it also takes quite some time per
microcode engine, increase the exit timeout by the number of CPUs on the
system.

That's ok because the moment all CPUs are done, that timeout will be cut
short.

Furthermore, panic when some of the CPUs timeout when returning from a
microcode update: we can't allow a system with not all cores updated.

Also, as an optimization, do not do the exit sync if microcode wasn't
updated.

Reported-by: Emanuel Czirai <xftroxgpx@protonmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Emanuel Czirai <xftroxgpx@protonmail.com>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20180314183615.17629-2-bp@alien8.de
2018-03-16 20:55:51 +01:00
Borislav Petkov 2613f36ed9 x86/microcode: Attempt late loading only when new microcode is present
Return UCODE_NEW from the scanning functions to denote that new microcode
was found and only then attempt the expensive synchronization dance.

Reported-by: Emanuel Czirai <xftroxgpx@protonmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Emanuel Czirai <xftroxgpx@protonmail.com>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20180314183615.17629-1-bp@alien8.de
2018-03-16 20:55:51 +01:00
Peter Zijlstra edb39592a5 perf: Fix sibling iteration
Mark noticed that the change to sibling_list changed some iteration
semantics; because previously we used group_list as list entry,
sibling events would always have an empty sibling_list.

But because we now use sibling_list for both list head and list entry,
siblings will report as having siblings.

Fix this with a custom for_each_sibling_event() iterator.

Fixes: 8343aae661 ("perf/core: Remove perf_event::group_entry")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: vincent.weaver@maine.edu
Cc: alexander.shishkin@linux.intel.com
Cc: torvalds@linux-foundation.org
Cc: alexey.budankov@linux.intel.com
Cc: valery.cherepennikov@intel.com
Cc: eranian@google.com
Cc: acme@redhat.com
Cc: linux-tip-commits@vger.kernel.org
Cc: davidcc@google.com
Cc: kan.liang@intel.com
Cc: Dmitry.Prohorov@intel.com
Cc: jolsa@redhat.com
Link: https://lkml.kernel.org/r/20180315170129.GX4043@hirez.programming.kicks-ass.net
2018-03-16 20:44:12 +01:00
Rajvi Jingar fc804f65d4 x86/tsc: Convert ART in nanoseconds to TSC
Device drivers use get_device_system_crosststamp() to produce precise
system/device cross-timestamps. The PHC clock and ALSA interfaces, for
example, make the cross-timestamps available to user applications.  On
Intel platforms, get_device_system_crosststamp() requires a TSC value
derived from ART (Always Running Timer) to compute the monotonic raw and
realtime system timestamps.

Starting with Intel Goldmont platforms, the PCIe root complex supports the
PTM time sync protocol. PTM requires all timestamps to be in units of
nanoseconds. The Intel root complex hardware propagates system time derived
from ART in units of nanoseconds performing the conversion as follows:

     ART_NS = ART * 1e9 / <crystal frequency>

When user software requests a cross-timestamp, the system timestamps
(generally read from device registers) must be converted to TSC by the
driver software as follows:

    TSC = ART_NS * TSC_KHZ / 1e6

This is valid when CPU feature flag X86_FEATURE_TSC_KNOWN_FREQ is set
indicating that tsc_khz is derived from CPUID[15H]. Drivers should check
whether this flag is set before conversion to TSC is attempted.

Suggested-by: Christopher S. Hall <christopher.s.hall@intel.com>
Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Link: https://lkml.kernel.org/r/1520530116-4925-1-git-send-email-rajvi.jingar@intel.com
2018-03-16 15:14:35 +01:00
Tom Lendacky daaf216c06 KVM: x86: Fix device passthrough when SME is active
When using device passthrough with SME active, the MMIO range that is
mapped for the device should not be mapped encrypted.  Add a check in
set_spte() to insure that a page is not mapped encrypted if that page
is a device MMIO page as indicated by kvm_is_mmio_pfn().

Cc: <stable@vger.kernel.org> # 4.14.x-
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16 14:32:23 +01:00
Alexander Sergeyev e3b3121fa8 x86/speculation: Remove Skylake C2 from Speculation Control microcode blacklist
In accordance with Intel's microcode revision guidance from March 6 MCU
rev 0xc2 is cleared on both Skylake H/S and Skylake Xeon E3 processors
that share CPUID 506E3.

Signed-off-by: Alexander Sergeyev <sergeev917@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jia Zhang <qianyue.zj@alibaba-inc.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kyle Huey <me@kylehuey.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180313193856.GA8580@localhost.localdomain
2018-03-16 12:33:11 +01:00
Dave Martin 266da65e91 signal: Add FPE_FLTUNK si_code for undiagnosable fp exceptions
Some architectures cannot always report accurately what kind of
floating-point exception triggered a floating-point exception trap.

This can occur with fp exceptions occurring on lanes in a vector
instruction on arm64 for example.

Rather than have every architecture come up with its own way of
describing such a condition, this patch adds a common FPE_FLTUNK
si_code value to report that an fp exception caused a trap but we
cannot be certain which kind of fp exception it was.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-03-15 16:04:25 -05:00
Toshi Kani 565977a3d9 x86/mm: Remove pointless checks in vmalloc_fault
vmalloc_fault() sets user's pgd or p4d from the kernel page table.  Once
it's set, all tables underneath are identical. There is no point of
following the same page table with two separate pointers and make sure they
see the same with BUG().

Remove the pointless checks in vmalloc_fault(). Also rename the kernel
pgd/p4d pointers to pgd_k/p4d_k so that their names are consistent in the
file.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: Gratian Crisan <gratian.crisan@ni.com>
Link: https://lkml.kernel.org/r/20180314205932.7193-1-toshi.kani@hpe.com
2018-03-15 15:27:47 +01:00
Benjamin Gaignard 13cc36d76b x86/rtc: Stop using deprecated functions
rtc_time_to_tm() and rtc_tm_to_time() are deprecated because they
rely on 32bits variables and that will make rtc break in y2038/2016.

Use the proper y2038 safe functions.

Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Link: https://lkml.kernel.org/r/1520620971-9567-5-git-send-email-john.stultz@linaro.org
2018-03-15 09:47:24 +01:00
Dan Williams a7e6c7015b x86, memremap: fix altmap accounting at free
Commit 24b6d41643 "mm: pass the vmem_altmap to vmemmap_free" converted
the vmemmap_free() path to pass the altmap argument all the way through
the call chain rather than looking it up based on the page.
Unfortunately that ends up over freeing altmap allocated pages in some
cases since free_pagetable() is used to free both memmap space and pte
space, where only the memmap stored in huge pages uses altmap
allocations.

Given that altmap allocations for memmap space are special cased in
vmemmap_populate_hugepages() add a symmetric / special case
free_hugepage_table() to handle altmap freeing, and cleanup the unneeded
passing of altmap to leaf functions that do not require it.

Without this change the sanity check accounting in
devm_memremap_pages_release() will throw a warning with the following
signature.

 nd_pmem pfn10.1: devm_memremap_pages_release: failed to free all reserved pages
 WARNING: CPU: 44 PID: 3539 at kernel/memremap.c:310 devm_memremap_pages_release+0x1c7/0x220
 CPU: 44 PID: 3539 Comm: ndctl Tainted: G             L   4.16.0-rc1-linux-stable #7
 RIP: 0010:devm_memremap_pages_release+0x1c7/0x220
 [..]
 Call Trace:
  release_nodes+0x225/0x270
  device_release_driver_internal+0x15d/0x210
  bus_remove_device+0xe2/0x160
  device_del+0x130/0x310
  ? klist_release+0x56/0x100
  ? nd_region_notify+0xc0/0xc0 [libnvdimm]
  device_unregister+0x16/0x60

This was missed in testing since not all configurations will trigger
this warning.

Fixes: 24b6d41643 ("mm: pass the vmem_altmap to vmemmap_free")
Reported-by: Jane Chu <jane.chu@oracle.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-03-14 14:46:23 -07:00
Thomas Gleixner 745dd37f9d Merge branch 'x86/urgent' into x86/mm to pick up dependencies 2018-03-14 20:23:25 +01:00
Toshi Kani 18a955219b x86/mm: Fix vmalloc_fault to use pXd_large
Gratian Crisan reported that vmalloc_fault() crashes when CONFIG_HUGETLBFS
is not set since the function inadvertently uses pXn_huge(), which always
return 0 in this case.  ioremap() does not depend on CONFIG_HUGETLBFS.

Fix vmalloc_fault() to call pXd_large() instead.

Fixes: f4eafd8bcd ("x86/mm: Fix vmalloc_fault() to handle large pages properly")
Reported-by: Gratian Crisan <gratian.crisan@ni.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20180313170347.3829-2-toshi.kani@hpe.com
2018-03-14 20:22:42 +01:00
Andy Whitcroft a14bff1311 x86/speculation, objtool: Annotate indirect calls/jumps for objtool on 32-bit kernels
In the following commit:

  9e0e3c5130 ("x86/speculation, objtool: Annotate indirect calls/jumps for objtool")

... we added annotations for CALL_NOSPEC/JMP_NOSPEC on 64-bit x86 kernels,
but we did not annotate the 32-bit path.

Annotate it similarly.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180314112427.22351-1-apw@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-14 13:24:31 +01:00
Andy Lutomirski b506978245 x86/vm86/32: Fix POPF emulation
POPF would trap if VIP was set regardless of whether IF was set.  Fix it.

Suggested-by: Stas Sergeev <stsp@list.ru>
Reported-by: Bart Oldeman <bartoldeman@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 5ed92a8ab7 ("x86/vm86: Use the normal pt_regs area for vm86")
Link: http://lkml.kernel.org/r/ce95f40556e7b2178b6bc06ee9557827ff94bd28.1521003603.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-14 09:21:01 +01:00
Peter Zijlstra 8343aae661 perf/core: Remove perf_event::group_entry
Now that all the grouping is done with RB trees, we no longer need
group_entry and can replace the whole thing with sibling_list.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valery Cherepennikov <valery.cherepennikov@intel.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 15:28:49 +01:00
Andy Shevchenko 0242874263 x86/platform/intel-mid: Add special handling for ACPI HW reduced platforms
When switching to ACPI HW reduced platforms we still want to initialize timers.
Override x86_init.acpi.reduced_hw_init to achieve that.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-acpi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180220180506.65523-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:32:57 +01:00
Andy Shevchenko 81b53e5ff2 ACPI, x86/boot: Introduce the ->reduced_hw_early_init() ACPI callback
Some ACPI hardware reduced platforms need to initialize certain devices
defined by the ACPI hardware specification even though in principle
those devices should not be present in an ACPI hardware reduced platform.

To allow that to happen, make it possible to override the generic
x86_init callbacks and provide a custom legacy_pic value, add a new
->reduced_hw_early_init() callback to struct x86_init_acpi and make
acpi_reduced_hw_init() use it.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-acpi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180220180506.65523-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:32:57 +01:00
Andy Shevchenko 50beba07a0 ACPI, x86/boot: Split out acpi_generic_reduce_hw_init() and export
This is a preparation patch to allow override the hardware reduced
initialization on ACPI enabled platforms.

No functional change intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-acpi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180220180506.65523-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:32:57 +01:00
Dmitry Vyukov ac605bee0b locking/atomic, asm-generic, x86: Add comments for atomic instrumentation
The comments are factored out from the code changes to make them
easier to read. Add them separately to explain some non-obvious
aspects.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/cc595efc644bb905407012d82d3eb8bac3368e7a.1517246437.git.dvyukov@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:15:35 +01:00
Dmitry Vyukov 8bf705d130 locking/atomic/x86: Switch atomic.h to use atomic-instrumented.h
Add arch_ prefix to all atomic operations and include
<asm-generic/atomic-instrumented.h>. This will allow
to add KASAN instrumentation to all atomic ops.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/54f0eb64260b84199e538652e079a89b5423ad41.1517246437.git.dvyukov@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:15:35 +01:00
Kirill A. Shutemov 24c517856a x86/pconfig: Provide defines and helper to run MKTME_KEY_PROG leaf
MKTME_KEY_PROG allows to manipulate MKTME keys in the CPU.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kai Huang <kai.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180305162610.37510-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:10:54 +01:00
Kirill A. Shutemov be7825c19b x86/pconfig: Detect PCONFIG targets
Intel PCONFIG targets are enumerated via new CPUID leaf 0x1b. This patch
detects all supported targets of PCONFIG and implements helper to check
if the target is supported.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kai Huang <kai.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180305162610.37510-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:10:54 +01:00
Kirill A. Shutemov cb06d8e3d0 x86/tme: Detect if TME and MKTME is activated by BIOS
IA32_TME_ACTIVATE MSR (0x982) can be used to check if BIOS has enabled
TME and MKTME. It includes which encryption policy/algorithm is selected
for TME or available for MKTME. For MKTME, the MSR also enumerates how
many KeyIDs are available.

We would need to exclude KeyID bits from physical address bits.
detect_tme() would adjust cpuinfo_x86::x86_phys_bits accordingly.

We have to do this even if we are not going to use KeyID bits
ourself. VM guests still have to know that these bits are not usable
for physical address.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kai Huang <kai.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180305162610.37510-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:10:54 +01:00
Ingo Molnar 3c76db70eb Merge branch 'x86/pti' into x86/mm, to pick up dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:10:03 +01:00
Kirill A. Shutemov 7958b2246f x86/cpufeatures: Add Intel PCONFIG cpufeature
CPUID.0x7.0x0:EDX[18] indicates whether Intel CPU support PCONFIG instruction.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kai Huang <kai.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180305162610.37510-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:09:53 +01:00
Kirill A. Shutemov 1da961d72a x86/cpufeatures: Add Intel Total Memory Encryption cpufeature
CPUID.0x7.0x0:ECX[13] indicates whether CPU supports Intel Total Memory
Encryption.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kai Huang <kai.huang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180305162610.37510-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 12:09:53 +01:00
Kirill A. Shutemov 194a9749c7 x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
This patch addresses a shortcoming in current boot process on machines
that supports 5-level paging.

If a bootloader enables 64-bit mode with 4-level paging, we might need to
switch over to 5-level paging. The switching requires the disabling
paging. It works fine if kernel itself is loaded below 4G.

But if the bootloader put the kernel above 4G (not sure if anybody does
this), we would lose control as soon as paging is disabled, because the
code becomes unreachable to the CPU.

This patch implements a trampoline in lower memory to handle this
situation.

We only need the memory for a very short time, until the main kernel
image sets up own page tables.

We go through the trampoline even if we don't have to: if we're already
in 5-level paging mode or if we don't need to switch to it. This way the
trampoline gets tested on every boot.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180312100246.89175-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:49:25 +01:00
Kirill A. Shutemov 0a1756bd28 x86/boot/compressed/64: Use page table in trampoline memory
If a bootloader enables 64-bit mode with 4-level paging, we might need to
switch over to 5-level paging. The switching requires the disabling
paging. It works fine if kernel itself is loaded below 4G.

But if the bootloader put the kernel above 4G (i.e. in kexec() case),
we would lose control as soon as paging is disabled, because the code
becomes unreachable to the CPU.

To handle the situation, we need a trampoline in lower memory that would
take care of switching on 5-level paging.

Apart from the trampoline code itself we also need a place to store
top-level page table in lower memory as we don't have a way to load
64-bit values into CR3 in 32-bit mode. We only really need 8 bytes there
as we only use the very first entry of the page table. But we allocate a
whole page anyway.

This patch switches 32-bit code to use page table in trampoline memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180312100246.89175-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:49:25 +01:00
Kirill A. Shutemov f7ff53e470 x86/boot/compressed/64: Use stack from trampoline memory
As the first step on using trampoline memory, let's make 32-bit code use
stack there.

Separate stack is required to return back from trampoline and we cannot
user stack from 64-bit mode as it may be above 4G.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180312100246.89175-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:49:24 +01:00
Kirill A. Shutemov 7beebaccd5 x86/boot/compressed/64: Make sure we have a 32-bit code segment
When kernel starts in 64-bit mode we inherit the GDT from the bootloader.
It may cause a problem if the GDT doesn't have a 32-bit code segment
where we expect it to be.

Load our own GDT with known segments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180312100246.89175-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:49:24 +01:00
Sai Praneeth 03781e4089 x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
Use helper function efi_switch_mm() to switch to/from efi_mm when
invoking any UEFI runtime services.

Likewise, we need to switch back to previous mm (mm context stolen
by efi_mm) after the above calls return successfully. We can use
efi_switch_mm() helper function only with x86_64 kernel and
"efi=old_map" disabled because, x86_32 and efi=old_map do not use
efi_pgd, rather they use swapper_pg_dir.

Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
[ardb: add #include of sched/task.h for task_lock/_unlock]
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Lee, Chun-Yi <jlee@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:05:05 +01:00
Sai Praneeth 3ede3417f8 x86/efi: Replace efi_pgd with efi_mm.pgd
Since the previous patch added support for efi_mm, let's handle efi_pgd
through efi_mm and remove global variable efi_pgd.

Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Lee, Chun-Yi <jlee@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 11:05:05 +01:00
Kirill A. Shutemov a5b162b2ec x86/mm: Do not use paravirtualized calls in native_set_p4d()
In 4-level paging mode, native_set_p4d() updates the entry in the top-level
page table. With PTI, update to the top-level kernel page table requires
update to the userspace copy of the table as well, using pti_set_user_pgd().

native_set_p4d() uses p4d_val() and pgd_val() to convert types between
p4d_t and pgd_t.

p4d_val() and pgd_val() are paravirtualized and we must not use them in
native helpers, as they crash the boot in paravirtualized environments.

Replace p4d_val() and pgd_val() with native_p4d_val() and
native_pgd_val() in native_set_p4d().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 91f606a8fa ("x86/mm: Replace compile-time checks for 5-level paging with runtime-time checks")
Link: http://lkml.kernel.org/r/20180305081641.4290-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 10:30:48 +01:00
Ard Biesheuvel 36b649760e efi: Use string literals for efi_char16_t variable initializers
Now that we unambiguously build the entire kernel with -fshort-wchar,
it is no longer necessary to open code efi_char16_t[] initializers as
arrays of characters, and we can move to the L"xxx" notation instead.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180312084500.10764-6-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 10:05:02 +01:00
Sai Praneeth 7e904a91bf efi: Use efi_mm in x86 as well as ARM
Presently, only ARM uses mm_struct to manage EFI page tables and EFI
runtime region mappings. As this is the preferred approach, let's make
this data structure common across architectures. Specially, for x86,
using this data structure improves code maintainability and readability.

Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
[ardb: don't #include the world to get a declaration of struct mm_struct]
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Lee, Chun-Yi <jlee@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Cc: Ricardo Neri <ricardo.neri@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180312084500.10764-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 10:05:01 +01:00
Ingo Molnar b0599e2801 Merge branch 'x86/mm' into efi/core
This commit in x86/mm changed EFI code:

   116fef640859: x86/mm/dump_pagetables: Add the EFI pagetable to the debugfs 'page_tables' directory

So merge in that commit plus its dependencies, before continuing with
EFI work.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 10:03:09 +01:00
Baoquan He c100a58360 kdump, vmcoreinfo: Export pgtable_l5_enabled value
User-space utilities examining crash-kernels need to know if the
crashed kernel was in 5-level paging mode or not.

So write 'pgtable_l5_enabled' to vmcoreinfo, which covers these
three cases:

  pgtable_l5_enabled == 0 when:
   - Compiled with !CONFIG_X86_5LEVEL
   - Compiled with CONFIG_X86_5LEVEL=y while CPU has no 'la57' flag

  pgtable_l5_enabled != 0 when:
   - Compiled with CONFIG_X86_5LEVEL=y and CPU has 'la57' flag

Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: douly.fnst@cn.fujitsu.com
Cc: dyoung@redhat.com
Cc: ebiederm@xmission.com
Cc: kirill.shutemov@linux.intel.com
Cc: vgoyal@redhat.com
Link: http://lkml.kernel.org/r/20180302051801.19594-1-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:43:56 +01:00
Kirill A. Shutemov e9d0e6330e x86/boot/compressed/64: Prepare new top-level page table for trampoline
If trampoline code would need to switch between 4- and 5-level paging
modes, we have to use a page table in trampoline memory.

Having it in trampoline memory guarantees that it's below 4G and we can
point CR3 to it from 32-bit trampoline code.

We only use the page table if the desired paging mode doesn't match the
mode we are in. Otherwise the page table is unused and trampoline code
wouldn't touch CR3.

For 4- to 5-level paging transition, we set up current (4-level paging)
CR3 as the first and the only entry in a new top-level page table.

For 5- to 4-level paging transition, copy page table pointed by first
entry in the current top-level page table as our new top-level page
table.

If the page table is used by trampoline we would need to copy it to new
page table outside trampoline and update CR3 before restoring trampoline
memory.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:37:26 +01:00
Kirill A. Shutemov 32fcefa2bf x86/boot/compressed/64: Set up trampoline memory
This patch clears up trampoline memory and copies trampoline code in
place. It's not yet used though.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:37:25 +01:00
Kirill A. Shutemov fb5268354d x86/boot/compressed/64: Save and restore trampoline memory
The memory area we found for trampoline shouldn't contain anything
useful. But let's preserve the data anyway. Just to be on safe side.

paging_prepare() would save the data into a buffer.

cleanup_trampoline() would restore it back once we are done with the
trampoline.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:37:25 +01:00
Kirill A. Shutemov 3548e131ec x86/boot/compressed/64: Find a place for 32-bit trampoline
If a bootloader enables 64-bit mode with 4-level paging, we might need to
switch over to 5-level paging. The switching requires the disabling of
paging, which works fine if kernel itself is loaded below 4G.

But if the bootloader puts the kernel above 4G (not sure if anybody does
this), we would lose control as soon as paging is disabled, because the
code becomes unreachable to the CPU.

To handle the situation, we need a trampoline in lower memory that would
take care of switching on 5-level paging.

This patch finds a spot in low memory for a trampoline.

The heuristic is based on code in reserve_bios_regions().

We find the end of low memory based on BIOS and EBDA start addresses.
The trampoline is put just before end of low memory. It's mimic approach
taken to allocate memory for realtime trampoline.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:37:23 +01:00
Kirill A. Shutemov a403d79818 x86/boot/compressed/64: Describe the logic behind the LA57 check
The patch explains the LA57 check in more details.

Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180226180451.86788-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12 09:29:24 +01:00