This merges in a couple of fixes from the kvm-ppc-fixes branch that
modify the same areas of code as some commits from the kvm-ppc-next
branch, in order to resolve the conflicts.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Commit 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing
implementation", 2016-12-20) added code that tries to exclude any use
or update of the hashed page table (HPT) while the HPT resizing code
is iterating through all the entries in the HPT. It does this by
taking the kvm->lock mutex, clearing the kvm->arch.hpte_setup_done
flag and then sending an IPI to all CPUs in the host. The idea is
that any VCPU task that tries to enter the guest will see that the
hpte_setup_done flag is clear and therefore call kvmppc_hv_setup_htab_rma,
which also takes the kvm->lock mutex and will therefore block until
we release kvm->lock.
However, any VCPU that is already in the guest, or is handling a
hypervisor page fault or hypercall, can re-enter the guest without
rechecking the hpte_setup_done flag. The IPI will cause a guest exit
of any VCPUs that are currently in the guest, but does not prevent
those VCPU tasks from immediately re-entering the guest.
The result is that after resize_hpt_rehash_hpte() has made a HPTE
absent, a hypervisor page fault can occur and make that HPTE present
again. This includes updating the rmap array for the guest real page,
meaning that we now have a pointer in the rmap array which connects
with pointers in the old rev array but not the new rev array. In
fact, if the HPT is being reduced in size, the pointer in the rmap
array could point outside the bounds of the new rev array. If that
happens, we can get a host crash later on such as this one:
[91652.628516] Unable to handle kernel paging request for data at address 0xd0000000157fb10c
[91652.628668] Faulting instruction address: 0xc0000000000e2640
[91652.628736] Oops: Kernel access of bad area, sig: 11 [#1]
[91652.628789] LE SMP NR_CPUS=1024 NUMA PowerNV
[91652.628847] Modules linked in: binfmt_misc vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal ipmi_powernv ipmi_devintf i2c_core ipmi_msghandler powernv_op_panel nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm scsi_dh_alua dm_service_time dm_multipath tg3 ptp pps_core [last unloaded: stap_552b612747aec2da355051e464fa72a1_14259]
[91652.629566] CPU: 136 PID: 41315 Comm: CPU 21/KVM Tainted: G O 4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le #1
[91652.629684] task: c0000007a419e400 task.stack: c0000000028d8000
[91652.629750] NIP: c0000000000e2640 LR: d00000000c36e498 CTR: c0000000000e25f0
[91652.629829] REGS: c0000000028db5d0 TRAP: 0300 Tainted: G O (4.14.0-1.rc4.dev.gitb27fc5c.el7.centos.ppc64le)
[91652.629932] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44022422 XER: 00000000
[91652.630034] CFAR: d00000000c373f84 DAR: d0000000157fb10c DSISR: 40000000 SOFTE: 1
[91652.630034] GPR00: d00000000c36e498 c0000000028db850 c000000001403900 c0000007b7960000
[91652.630034] GPR04: d0000000117fb100 d000000007ab00d8 000000000033bb10 0000000000000000
[91652.630034] GPR08: fffffffffffffe7f 801001810073bb10 d00000000e440000 d00000000c373f70
[91652.630034] GPR12: c0000000000e25f0 c00000000fdb9400 f000000003b24680 0000000000000000
[91652.630034] GPR16: 00000000000004fb 00007ff7081a0000 00000000000ec91a 000000000033bb10
[91652.630034] GPR20: 0000000000010000 00000000001b1190 0000000000000001 0000000000010000
[91652.630034] GPR24: c0000007b7ab8038 d0000000117fb100 0000000ec91a1190 c000001e6a000000
[91652.630034] GPR28: 00000000033bb100 000000000073bb10 c0000007b7960000 d0000000157fb100
[91652.630735] NIP [c0000000000e2640] kvmppc_add_revmap_chain+0x50/0x120
[91652.630806] LR [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.630884] Call Trace:
[91652.630913] [c0000000028db850] [c0000000028db8b0] 0xc0000000028db8b0 (unreliable)
[91652.630996] [c0000000028db8b0] [d00000000c36e498] kvmppc_book3s_hv_page_fault+0xbb8/0xc40 [kvm_hv]
[91652.631091] [c0000000028db9e0] [d00000000c36a078] kvmppc_vcpu_run_hv+0xdf8/0x1300 [kvm_hv]
[91652.631179] [c0000000028dbb30] [d00000000c2248c4] kvmppc_vcpu_run+0x34/0x50 [kvm]
[91652.631266] [c0000000028dbb50] [d00000000c220d54] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
[91652.631351] [c0000000028dbbd0] [d00000000c2139d8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[91652.631433] [c0000000028dbd40] [c0000000003832e0] do_vfs_ioctl+0xd0/0x8c0
[91652.631501] [c0000000028dbde0] [c000000000383ba4] SyS_ioctl+0xd4/0x130
[91652.631569] [c0000000028dbe30] [c00000000000b8e0] system_call+0x58/0x6c
[91652.631635] Instruction dump:
[91652.631676] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ffa1 2fa70000 793d0020 e9432110
[91652.631814] 7bbf26e4 7c7e1b78 7feafa14 409e0094 <807f000c> 786326e4 7c6a1a14 93a40008
[91652.631959] ---[ end trace ac85ba6db72e5b2e ]---
To fix this, we tighten up the way that the hpte_setup_done flag is
checked to ensure that it does provide the guarantee that the resizing
code needs. In kvmppc_run_core(), we check the hpte_setup_done flag
after disabling interrupts and refuse to enter the guest if it is
clear (for a HPT guest). The code that checks hpte_setup_done and
calls kvmppc_hv_setup_htab_rma() is moved from kvmppc_vcpu_run_hv()
to a point inside the main loop in kvmppc_run_vcpu(), ensuring that
we don't just spin endlessly calling kvmppc_run_core() while
hpte_setup_done is clear, but instead have a chance to block on the
kvm->lock mutex.
Finally we also check hpte_setup_done inside the region in
kvmppc_book3s_hv_page_fault() where the HPTE is locked and we are about
to update the HPTE, and bail out if it is clear. If another CPU is
inside kvm_vm_ioctl_resize_hpt_commit) and has cleared hpte_setup_done,
then we know that either we are looking at a HPTE
that resize_hpt_rehash_hpte() has not yet processed, which is OK,
or else we will see hpte_setup_done clear and refuse to update it,
because of the full barrier formed by the unlock of the HPTE in
resize_hpt_rehash_hpte() combined with the locking of the HPTE
in kvmppc_book3s_hv_page_fault().
Fixes: 5e9859699a ("KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Satheesh Rajendran <satheera@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This sets up the machinery for switching a guest between HPT (hashed
page table) and radix MMU modes, so that in future we can run a HPT
guest on a radix host on POWER9 machines.
* The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
radix mode, on a radix host.
* The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
with HV KVM on a radix host.
* The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
radix host.
* The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
guest to HPT mode and allocate a HPT.
* For simplicity, we now allocate the rmap array for each memslot,
even on a radix host, since it will be needed if the guest switches
to HPT mode.
* Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
ioctl will return an EINVAL error in that case.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the HPT code in HV KVM maintains a dirty bit per guest page
in the rmap array, whether or not dirty page tracking has been enabled
for the memory slot. In contrast, the radix code maintains a dirty
bit per guest page in memslot->dirty_bitmap, and only does so when
dirty page tracking has been enabled.
This changes the HPT code to maintain the dirty bits in the memslot
dirty_bitmap like radix does. This results in slightly less code
overall, and will mean that we do not lose the dirty bits when
transitioning between HPT and radix mode in future.
There is one minor change to behaviour as a result. With HPT, when
dirty tracking was enabled for a memslot, we would previously clear
all the dirty bits at that point (both in the HPT entries and in the
rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
following would show no pages as dirty (assuming no vcpus have run
in the meantime). With this change, the dirty bits on HPT entries
are not cleared at the point where dirty tracking is enabled, so
KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
resident in the HPT and dirty. This is consistent with what happens
on radix.
This also fixes a bug in the mark_pages_dirty() function for radix
(in the sense that the function no longer exists). In the case where
a large page of 64 normal pages or more is marked dirty, the
addressing of the dirty bitmap was incorrect and could write past
the end of the bitmap. Fortunately this case was never hit in
practice because a 2MB large page is only 32 x 64kB pages, and we
don't support backing the guest with 1GB huge pages at this point.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This renames the kvm->arch.hpte_setup_done field to mmu_ready because
we will want to use it for radix guests too -- both for setting things
up before vcpu execution, and for excluding vcpus from executing while
MMU-related things get changed, such as in future switching the MMU
from radix to HPT mode or vice-versa.
This also moves the call to kvmppc_setup_partition_table() that was
done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
of mmu_ready, into the caller in kvmppc_vcpu_run_hv().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This removes the dependence of KVM on the mmu_psize_defs array (which
stores information about hardware support for various page sizes) and
the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(),
hpte_actual_page_size() and get_sllp_encoding(). We also no longer
rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature
bit.
The reason for doing this is so we can support a HPT guest on a radix
host. In a radix host, the mmu_psize_defs array contains information
about page sizes supported by the MMU in radix mode rather than the
page sizes supported by the MMU in HPT mode. Similarly, mmu_slb_size
and the MMU_FTR_1T_SEGMENTS bit are not set.
Instead we hard-code knowledge of the behaviour of the HPT MMU in the
POWER7, POWER8 and POWER9 processors (which are the only processors
supported by HV KVM) - specifically the encoding of the LP fields in
the HPT and SLB entries, and the fact that they have 32 SLB entries
and support 1TB segments.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to make sure that we don't try to access the
non-existent HPT for a radix guest using the htab file for the VM
in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD
ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls.
At present nothing bad happens if userspace does access these
interfaces on a radix guest, mostly because kvmppc_hpt_npte()
gives 0 for a radix guest, which in turn is because 1 << -4
comes out as 0 on POWER processors. However, that relies on
undefined behaviour, so it is better to be explicit about not
accessing the HPT for a radix guest.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We do ctx = kzalloc(sizeof(*ctx), GFP_KERNEL) and then later on call
anon_inode_getfd(), but if that fails we don't free ctx, so that
memory gets leaked. To fix it, this adds kfree(ctx) in the failure
path.
Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the 'ppc-kvm' topic branch from the powerpc tree in
order to bring in some fixes which touch both powerpc and KVM code.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Add newer helpers to make the function usage simpler. It is always
recommended to use find_current_mm_pte() for walking the page table.
If we cannot use find_current_mm_pte(), it should be documented why
the said usage of __find_linux_pte() is safe against a parallel THP
split.
For now we have KVM code using __find_linux_pte(). This is because kvm
code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
with PACA soft_enabled = 1. We may want to fix that later and make
sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
we can switch kvm to use find_linux_pte().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB
ioctl() to change HPT size", 2016-12-20) changed the behaviour of
the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT
and new revmap array if there was a previously-allocated HPT of a
different size from the size being requested. In this case, we need
to reset the rmap arrays of the memslots, because the rmap arrays
will contain references to HPTEs which are no longer valid. Worse,
these references are also references to slots in the new revmap
array (which parallels the HPT), and the new revmap array contains
random contents, since it doesn't get zeroed on allocation.
The effect of having these stale references to slots in the revmap
array that contain random contents is that subsequent calls to
functions such as kvmppc_add_revmap_chain will crash because they
will interpret the non-zero contents of the revmap array as HPTE
indexes and thus index outside of the revmap array. This leads to
host crashes such as the following.
[ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8
[ 7072.862218] Faulting instruction address: 0xc0000000000e1c78
[ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1]
[ 7072.862286] SMP NR_CPUS=1024
[ 7072.862286] NUMA
[ 7072.862325] PowerNV
[ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry
[ 7072.863085] nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv]
[ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59
[ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000
[ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0
[ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300 Not tainted (4.12.0-kvm+)
[ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]>
[ 7072.863667] CR: 44082882 XER: 20000000
[ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1
GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0
GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000
GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf
GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78
GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000
GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190
GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000
GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048
[ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130
[ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
[ 7072.864566] Call Trace:
[ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable)
[ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
[ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv]
[ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv]
[ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm]
[ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm]
[ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
[ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0
[ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130
[ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c
[ 7072.865644] Instruction dump:
[ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14
[ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000
[ 7072.866001] ---[ end trace 627b6e4bf8080edc ]---
Note that to trigger this, it is necessary to use a recent upstream
QEMU (or other userspace that resizes the HPT at CAS time), specify
a maximum memory size substantially larger than the current memory
size, and boot a guest kernel that does not support HPT resizing.
This fixes the problem by resetting the rmap arrays when the old HPT
is freed.
Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Cc: stable@vger.kernel.org # v4.11+
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.
Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.
- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.
Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.
This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.
[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kzalloc() won't actually fail because sizeof(*resize) is small, but
static checkers complain.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
We use pte_write() to check whethwer the pte entry is writable. This is
mostly used to later mark the pte read only if it is writable. The other
use of pte_write() is to check whether the pte_entry is writable so that
hardware page table entry can be marked accordingly. This is used in kvm
where we look at qemu page table entry and update hardware hash page table
for the guest with correct write enable bit.
With the above, for the first usage we should also check the savedwrite
bit so that we can correctly clear the savedwite bit. For the later, we
add a new variant __pte_write().
With this we can revert write_protect_page part of 595cd8f256 ("mm/ksm:
handle protnone saved writes when making page write protect"). But I left
it as it is as an example code for savedwrite check.
Fixes: c137a2757b ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new HPT resizing code added in commit b5baa68773 ("KVM: PPC:
Book3S HV: KVM-HV HPT resizing implementation", 2016-12-20) doesn't
have code to handle the new HPTE format which POWER9 uses. Thus it
would be best not to advertise it to userspace on POWER9 systems
until it works properly.
Also, since resize_hpt_rehash_hpte() contains BUG_ON() calls that
could be hit on POWER9, let's prevent it from being called on POWER9
for now.
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The average user likely does not know what a "htab" or "LPID" is,
and it's annoying that these messages are quickly filling the dmesg
log when you're doing boot cycle tests, so let's turn it into a debug
message instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
resize_hpt_release(), called once the HPT resize of a KVM guest is
completed (successfully or unsuccessfully) frees the state structure for
the resize. It is currently not safe to call with a NULL pointer.
However, one of the error paths in kvm_vm_ioctl_resize_hpt_commit() can
invoke it with a NULL pointer. This will occur if userspace improperly
invokes KVM_PPC_RESIZE_HPT_COMMIT without previously calling
KVM_PPC_RESIZE_HPT_PREPARE, or if it calls COMMIT twice without an
intervening PREPARE.
To fix this potential crash bug - and maybe others like it, make it safe
(and a no-op) to call resize_hpt_release() with a NULL resize pointer.
Found by Dan Carpenter with a static checker.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds the "guts" of the implementation for the HPT resizing PAPR
extension. It has the code to allocate and clear a new HPT, rehash an
existing HPT's entries into it, and accomplish the switchover for a
KVM guest from the old HPT to the new one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a not yet working outline of the HPT resizing PAPR
extension. Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.
The actual guts of the implementation isn't here yet, so for now the
calls will always fail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The kvm_unmap_rmapp() function, called from certain MMU notifiers, is used
to force all guest mappings of a particular host page to be set ABSENT, and
removed from the reverse mappings.
For HPT resizing, we will have some cases where we want to set just a
single guest HPTE ABSENT and remove its reverse mappings. To prepare with
this, we split out the logic from kvm_unmap_rmapp() to evict a single HPTE,
moving it to a new helper function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
table (HPT) that userspace expects a guest VM to have, and is also used to
clear that HPT when necessary (e.g. guest reboot).
At present, once the ioctl() is called for the first time, the HPT size can
never be changed thereafter - it will be cleared but always sized as from
the first call.
With upcoming HPT resize implementation, we're going to need to allow
userspace to resize the HPT at reset (to change it back to the default size
if the guest changed it).
So, we need to allow this ioctl() to change the HPT size.
This patch also updates Documentation/virtual/kvm/api.txt to reflect
the new behaviour. In fact the documentation was already slightly
incorrect since 572abd5 "KVM: PPC: Book3S HV: Don't fall back to
smaller HPT size in allocation ioctl"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
and sets it up as the active page table for a VM. For the upcoming HPT
resize implementation we're going to want to allocate HPTs separately from
activating them.
So, split the allocation itself out into kvmppc_allocate_hpt() and perform
the activation with a new kvmppc_set_hpt() function. Likewise we split
kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
which unsets it as an active HPT, then frees it.
We also move the logic to fall back to smaller HPT sizes if the first try
fails into the single caller which used that behaviour,
kvmppc_hv_setup_htab_rma(). This introduces a slight semantic change, in
that previously if the initial attempt at CMA allocation failed, we would
fall back to attempting smaller sizes with the page allocator. Now, we
try first CMA, then the page allocator at each size. As far as I can tell
this change should be harmless.
To match, we make kvmppc_free_hpt() just free the actual HPT itself. The
call to kvmppc_free_lpid() that was there, we move to the single caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size. The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV. This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
all obvious from the name. In practice kvmppc_alloc_hpt() allocates an HPT
by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
it with CMA only.
To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds a few last pieces of the support for radix guests:
* Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
KVM_PPC_GET_RMMU_INFO ioctls for radix guests
* On POWER9, allow secondary threads to be on/off-lined while guests
are running.
* Set up LPCR and the partition table entry for radix guests.
* Don't allocate the rmap array in the kvm_memory_slot structure
on radix.
* Don't try to initialize the HPT for radix guests, since they don't
have an HPT.
* Take out the code that prevents the HV KVM module from
initializing on radix hosts.
At this stage, we only support radix guests if the host is running
in radix mode, and only support HPT guests if the host is running in
HPT mode. Thus a guest cannot switch from one mode to the other,
which enables some simplifications.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds code to keep track of dirty pages when requested (that is,
when memslot->dirty_bitmap is non-NULL) for radix guests. We use the
dirty bits in the PTEs in the second-level (partition-scoped) page
tables, together with a bitmap of pages that were dirty when their
PTE was invalidated (e.g., when the page was paged out). This bitmap
is stored in the first half of the memslot->dirty_bitmap area, and
kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
bitmap that gets returned to userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix. These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU. Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.
As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions. For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.
This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables. There is no support for 1GB huge pages
yet.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adapts the KVM-HV hashed page table (HPT) code to read and write
HPT entries in the new format defined in Power ISA v3.00 on POWER9
machines. The new format moves the B (segment size) field from the
first doubleword to the second, and trims some bits from the AVA
(abbreviated virtual address) and ARPN (abbreviated real page number)
fields. As far as possible, the conversion is done when reading or
writing the HPT entries, and the rest of the code continues to use
the old format.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently we mark a HPTE for emulated MMIO with HPTE_V_ABSENT bit
set as well as key 0x1f. However, those HPTEs may be conflicted with
the HPTE for real guest RAM page HPTE with key 0x1f when the page
get paged out.
This patch clears the key field of HPTE when the page is paged out,
then recover it when HPTE is re-established.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Squash a couple of sparse warnings by making things static.
Build tested.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
PowerISA 3.0 introduces two pte bits with the below meaning for radix:
00 -> Normal Memory
01 -> Strong Access Order (SAO)
10 -> Non idempotent I/O (Cache inhibited and guarded)
11 -> Tolerant I/O (Cache inhibited)
We drop the existing WIMG bits in the Linux page table in favour of the
above constants. We loose _PAGE_WRITETHRU with this conversion. We only
use writethru via pgprot_cached_wthru() which is used by
fbdev/controlfb.c which is Apple control display and also PPC32.
With respect to _PAGE_COHERENCE, we have been marking hpte always
coherent for some time now. htab_convert_pte_flags() always added
HPTE_R_M.
NOTE: KVM changes need closer review.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference. from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King
- Disable hugepd for 64K page size. from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file. from Paul Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages. from Christophe Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump
support, a rework of the qoriq clock driver, device tree changes including
qoriq fman nodes, support for a new 85xx board, and some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x
LocalPlus Bus FIFO with its device tree binding documentation, mpc512x
device tree updates and some minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWPEZgAAoJEFHr6jzI4aWANjYQAKX2Q/95hqKfCuF5FBcUmtMC
Pu/Nff027MVzxZ2ApDcvvLGps5Nz2bn3nIhc9zjkXc5E8DuL6X3Yl8ce7qyNcc3g
cJJ8RvtUo6J1OMWetXFehtPYniAAwKMhZYKnj0+WnLr2SyH/Vhl3ehDkFbGyPtuH
r+2E7krFjfVgU+bzciIFnOaDekFuFN/pXWMb6e6zQyBJe9N8ZIp96uouGCebKVd0
VDLItzdaKErT8JFfbymMPvZm3V0rMVx4WWu3kAbQX8LrD5a18NF1zrjAOHRXc61n
kkk8/DPuNOon1PbXXyiS5BcFyZRe+KE3VBnoW5sOMqMIRg5WdO1oU3e2pEfXMO8+
leXYwFLXiKzUZuOgQG2QiUhrzD2yC1o6/TJWATv0dSl9AwrecgPX+Vj6X357slAf
A9E3eMy5tgnpndBWZmvZS3W7YDKH+NkeZ+Q40+NErAlqr++ErrTcKVndk5vWlYTT
7mMZeTXagX66al/k5ATKqwB7iUSpnYHSAa9fcUYPSM2FnXsDxPyeJGkBbcoOmkGj
QrpgNYOvJaUJd076goZCV39v0c1xpfV9/9kyVch8HUadf6JcjpVZwYnbGw2qlJjh
ZanuBG2VOeSwaKQqXiRBSBetnpAg8CVpFjDmX9wOBfSek2wxEJqDX/vQExdbIDQQ
pUs7vnUxLzhmW/x+ygOI
=YwcM
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Kconfig: remove BE-only platforms from LE kernel build from Boqun
Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael
Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan
Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel
Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
Colin Ian King
- Disable hugepd for 64K page size, from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh
Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael
Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe
Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis
Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages, from Christophe
Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e
kexec/kdump support, a rework of the qoriq clock driver, device tree
changes including qoriq fman nodes, support for a new 85xx board, and
some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for
MPC512x LocalPlus Bus FIFO with its device tree binding
documentation, mpc512x device tree updates and some minor fixes.
* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
powerpc/pseries: Correct string length in pseries_of_derive_parent()
powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
powerpc: handle error case in cpm_muram_alloc()
powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
powerpc/book3e-64: Enable kexec
powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
powerpc/book3e-64/kexec: Enable SMP release
powerpc/book3e-64/kexec: create an identity TLB mapping
powerpc/book3e-64: Don't limit paca to 256 MiB
powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
powerpc/book3e: support CONFIG_RELOCATABLE
powerpc/booke64: Fix args to copy_and_flush
powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
powerpc/e6500: kexec: Handle hardware threads
...
Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested
size of HPT, and if that is not possible, then try to allocate smaller
sizes (by factors of 2) until either a minimum is reached or the
allocation succeeds. This is not ideal for userspace, particularly in
migration scenarios, where the destination VM really does require the
size requested. Also, the minimum HPT size of 256kB may be
insufficient for the guest to run successfully.
This removes the fallback to smaller sizes on allocation failure for
the KVM_PPC_ALLOCATE_HTAB ioctl. The fallback still exists for the
case where the HPT is allocated at the time the first VCPU is run, if
no HPT has been allocated by ioctl by that time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
We need to properly identify whether a hugepage is an explicit or
a transparent hugepage in follow_huge_addr(). We used to depend
on hugepage shift argument to do that. But in some case that can
result in wrong results. For ex:
On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
We do prevent reusing the pfn page via the usage of
kick_all_cpus_sync(). But that happens after we updated the pte to 0.
Hence in follow_huge_addr() we can find hugepage shift set, but transparent
huge page check fail for a thp pte.
NOTE: We fixed a variant of this race against thp split in commit
691e95fd73
("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
follow_page_mask occasionally.
In the long term, we may want to switch ppc64 64k page size config to
enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
kvm_memslots provides lockdep checking. Use it consistently instead of
explicit dereferencing of kvm->memslots.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Fix for mm_dec_nr_pmds() from Scott.
- Fixes for oopses seen with KVM + THP from Aneesh.
- Build fixes from Aneesh & Shreyas.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVOsk5AAoJEFHr6jzI4aWACWMP/3EaNoeA1g8VbZWZEdRaoLvX
W7D08DI3Dt8HLxyn2JR08jYZF0gr68XrF6OiscVVki7wVXT8fbH4jSNmBkbzNH95
d9taScJyR1CUavkhsXivnR1qEE1Fi2KA2OW9RaNfoSt1MVtdsvOK6xXklUGksuJQ
XygzyrRr4Dj82kuMUAMO0YDMvknMlzi3a8dzyrWZBXBZOOTWavGB6bQKtCTaOQ99
3OFGLQ10uY7lmdHDi0t0tQ99FuYfLiJpg5fTLoUni4J5tFp8JlZ+x0Gwc0apN0cy
Ym8EO6++qWDv8FXvYEPfVUEjbF1fyPiawUgpkMnyvXgd8K5G85SIrtkGW0Ml+6sX
GfJH8w9hpDbF5EnWlC9bn/jT7sHBHFdrxZuQUc0L4M2OtM73R2a0Xr3b7ZxFCD1q
7RpYu8MKKcyvaIXNg7VBJjj8zL+WmUJKF6J5uX5bGU2xH0khmp0vTknyyjbwrlcF
uHidv5ZhMt3aAI70v14jA5BTEmLyOYRu58Ei6cT/VT/DjdbpEApdK8BMAvKSEeib
+hzh6oDFT92AM0tbg15bNmqGbGfgqtVKe4GDS2QyGaHGAFOGs1nPuSa9se1xYDcM
CCtRyABwpzJsrCfwra2fsTU6FxlatK4ONViyWFBXa6mEjBNSZ4XmyZvdWUqlwpSC
F5jNGppm5Ama6xxcLphA
=6yQx
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc fixes from Michael Ellerman:
- fix for mm_dec_nr_pmds() from Scott.
- fixes for oopses seen with KVM + THP from Aneesh.
- build fixes from Aneesh & Shreyas.
* tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled
powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error
powerpc/mm/thp: Return pte address if we find trans_splitting.
powerpc/mm/thp: Make page table walk safe against thp split/collapse
KVM: PPC: Remove page table walk helpers
KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer
powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()
This creates a debugfs directory for each HV guest (assuming debugfs
is enabled in the kernel config), and within that directory, a file
by which the contents of the guest's HPT (hashed page table) can be
read. The directory is named vmnnnn, where nnnn is the PID of the
process that created the guest. The file is named "htab". This is
intended to help in debugging problems in the host's management
of guest memory.
The contents of the file consist of a series of lines like this:
3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
The first field is the index of the entry in the HPT, the second and
third are the HPT entry, so the third entry contains the real page
number that is mapped by the entry if the entry's valid bit is set.
The fourth field is the guest's view of the second doubleword of the
entry, so it contains the guest physical address. (The format of the
second through fourth fields are described in the Power ISA and also
in arch/powerpc/include/asm/mmu-hash64.h.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds helper routines for locking and unlocking HPTEs, and uses
them in the rest of the code. We don't change any locking rules in
this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We don't support real-mode areas now that 970 support is removed.
Remove the remaining details of rma from the code. Also rename
rma_setup_done to hpte_setup_done to better reflect the changes.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
For THP that is marked trans splitting, we return the pte.
This require the callers to handle the pmd_trans_splitting scenario,
if they care. All the current callers are either looking at pfn or
write_ok, hence we don't need to update them.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We can disable a THP split or a hugepage collapse by disabling irq.
We do send IPI to all the cpus in the early part of split/collapse,
and disabling local irq ensure we don't make progress with
split/collapse. If the THP is getting split we return NULL from
find_linux_pte_or_hugepte(). For all the current callers it should be ok.
We need to be careful if we want to use returned pte_t pointer outside
the irq disabled region. W.r.t to THP split, the pfn remains the same,
but then a hugepage collapse will result in a pfn change. There are
few steps we can take to avoid a hugepage collapse.One way is to take page
reference inside the irq disable region. Other option is to take
mmap_sem so that a parallel collapse will not happen. We can also
disable collapse by taking pmd_lock. Another method used by kvm
subsystem is to check whether we had a mmu_notifer update in between
using mmu_notifier_retry().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.
Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch adds trace points in the guest entry and exit code and also
for exceptions handled by the host in kernel mode - hypercalls and page
faults. The new events are added to /sys/kernel/debug/tracing/events
under a new subsystem called kvm_hv.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>