linux/arch/x86/kvm
Andrea Arcangeli 127393fbe5 mm: thp: kvm: fix memory corruption in KVM with THP enabled
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound page
is immediately mappable from a secondary MMU.

A secondary MMU doesn't want to call get_user_pages() more than once for
each compound page, in order to know if it can map the whole compound
page.  So a secondary MMU needs to know from a single get_user_pages()
invocation when it can map immediately the entire compound page to avoid
a flood of unnecessary secondary MMU faults and spurious
atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
users).

Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages().  However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages).  So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and in
turn if the caller can assume that the whole compound page is mapped in
the current "mm" (in a pmd_trans_huge()).  In such case the entire
compound page is safe to map into the secondary MMU without additional
get_user_pages() calls on the surrounding tail/head pages.  In addition
of being faster, not having to run other get_user_pages() calls also
reduces the memory footprint of the secondary MMU fault in case the pmd
split happened as result of memory pressure.

Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with a
failure in split_huge_page() that would only result in pmd splitting and
not a physical page split), KVM would map the whole compound page into
the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of the
pte faults, leading to the guest mode and userland mode going out of
sync and not working on the same memory at all times.

Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want to
run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified to
be exposed not just to KVM.

The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data structures
in it, so it's definitely not a cut-and-paste work, so I couldn't do a
fix as cleaner as this one for 4.6.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Li, Liang Z" <liang.z.li@intel.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
..
Kconfig KVM: x86: select IRQ_BYPASS_MANAGER 2015-10-01 15:06:52 +02:00
Makefile KVM: page track: add the framework of guest page tracking 2016-03-03 14:36:20 +01:00
assigned-dev.c KVM: x86: use list_for_each_entry* 2016-02-23 15:40:54 +01:00
assigned-dev.h KVM: x86: move device assignment out of kvm_host.h 2014-11-24 16:53:50 +01:00
cpuid.c KVM: x86: mask CPUID(0xD,0x1).EAX against host value 2016-04-10 21:53:50 +02:00
cpuid.h KVM, pkeys: expose CPUID/CR4 to guest 2016-03-22 16:38:17 +01:00
emulate.c Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-03-20 18:23:21 -07:00
hyperv.c KVM: Hyper-V: do not do hypercall userspace exits if SynIC is disabled 2016-04-01 12:10:09 +02:00
hyperv.h kvm/x86: Hyper-V SynIC timers 2015-12-16 18:49:45 +01:00
i8254.c KVM: i8254: drop local copy of mul_u64_u32_div 2016-03-04 22:39:17 +01:00
i8254.h KVM: i8254: turn kvm_kpit_state.reinject into atomic_t 2016-03-04 09:30:25 +01:00
i8259.c KVM: x86: clean/fix memory barriers in irqchip_in_kernel 2015-07-30 16:02:56 +02:00
ioapic.c kvm: x86: Check dest_map->vector to match eoi signals for rtc 2016-03-03 14:36:18 +01:00
ioapic.h kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map 2016-03-03 14:36:18 +01:00
iommu.c kvm: rename pfn_t to kvm_pfn_t 2016-01-15 17:56:32 -08:00
irq.c KVM: x86: consolidate "has lapic" checks into irq.c 2016-02-09 16:57:39 +01:00
irq.h KVM: x86: consolidate different ways to test for in-kernel LAPIC 2016-02-09 16:57:45 +01:00
irq_comm.c kvm: x86: Convert ioapic->rtc_status.dest_map to a struct 2016-03-03 14:36:17 +01:00
kvm_cache_regs.h KVM, pkeys: add pkeys support for permission_fault 2016-03-22 16:23:37 +01:00
lapic.c kvm: x86: make lapic hrtimer pinned 2016-04-05 14:19:08 +02:00
lapic.h kvm: x86: Convert ioapic->rtc_status.dest_map to a struct 2016-03-03 14:36:17 +01:00
mmu.c mm: thp: kvm: fix memory corruption in KVM with THP enabled 2016-05-05 17:38:53 -07:00
mmu.h KVM: MMU: fix permission_fault() 2016-04-10 21:53:49 +02:00
mmu_audit.c kvm: rename pfn_t to kvm_pfn_t 2016-01-15 17:56:32 -08:00
mmutrace.h tracing: Rename ftrace_event.h to trace_events.h 2015-05-13 14:05:12 -04:00
mtrr.c KVM: MTRR: treat memory as writeback if MTRR is disabled in guest CPUID 2015-12-22 15:29:00 +01:00
page_track.c KVM: page_track: fix access to NULL slot 2016-03-22 17:27:28 +01:00
paging_tmpl.h KVM: MMU: fix permission_fault() 2016-04-10 21:53:49 +02:00
pmu.c KVM: x86: consolidate different ways to test for in-kernel LAPIC 2016-02-09 16:57:45 +01:00
pmu.h KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch 2015-06-23 14:12:14 +02:00
pmu_amd.c KVM: x86/vPMU: Fix unnecessary signed extension for AMD PERFCTRn 2015-08-11 15:19:41 +02:00
pmu_intel.c KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch 2015-06-23 14:12:14 +02:00
svm.c KVM, pkeys: add pkeys support for permission_fault 2016-03-22 16:23:37 +01:00
trace.h KVM/VMX: Add host irq information in trace event when updating IRTE for posted interrupts 2016-02-09 13:24:43 +01:00
tss.h KVM: x86: hardware task switching support 2008-04-27 12:00:39 +03:00
vmx.c perf/x86/intel/pt: Don't die on VMXON 2016-04-28 10:32:42 +02:00
x86.c kvm: x86: do not leak guest xcr0 into host interrupt handlers 2016-04-10 21:53:49 +02:00
x86.h KVM, pkeys: add pkeys support for xsave state 2016-03-22 16:21:05 +01:00