linux/arch/powerpc/mm/mmu_context_iommu.c

483 lines
11 KiB
C
Raw Normal View History

powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
/*
* IOMMU helpers in MMU context.
*
* Copyright (C) 2015 IBM Corp. <aik@ozlabs.ru>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*
*/
#include <linux/sched/signal.h>
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
#include <linux/slab.h>
#include <linux/rculist.h>
#include <linux/vmalloc.h>
#include <linux/mutex.h>
#include <linux/migrate.h>
#include <linux/hugetlb.h>
#include <linux/swap.h>
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
#include <linux/sizes.h>
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
#include <asm/mmu_context.h>
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
#include <asm/pte-walk.h>
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
#include <linux/mm_inline.h>
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
static DEFINE_MUTEX(mem_list_mutex);
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
#define MM_IOMMU_TABLE_GROUP_PAGE_DIRTY 0x1
#define MM_IOMMU_TABLE_GROUP_PAGE_MASK ~(SZ_4K - 1)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
struct mm_iommu_table_group_mem_t {
struct list_head next;
struct rcu_head rcu;
unsigned long used;
atomic64_t mapped;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
unsigned int pageshift;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
u64 ua; /* userspace address */
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
u64 entries; /* number of entries in hpas/hpages[] */
/*
* in mm_iommu_get we temporarily use this to store
* struct page address.
*
* We need to convert ua to hpa in real mode. Make it
* simpler by storing physical address.
*/
union {
struct page **hpages; /* vmalloc'ed */
phys_addr_t *hpas;
};
#define MM_IOMMU_TABLE_INVALID_HPA ((uint64_t)-1)
u64 dev_hpa; /* Device memory base address */
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
};
static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
unsigned long npages, bool incr)
{
long ret = 0, locked, lock_limit;
if (!npages)
return 0;
down_write(&mm->mmap_sem);
if (incr) {
locked = mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
mm->locked_vm += npages;
} else {
if (WARN_ON_ONCE(npages > mm->locked_vm))
npages = mm->locked_vm;
mm->locked_vm -= npages;
}
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
current ? current->pid : 0,
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
up_write(&mm->mmap_sem);
return ret;
}
bool mm_iommu_preregistered(struct mm_struct *mm)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
return !list_empty(&mm->context.iommu_group_mem_list);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
}
EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
unsigned long entries, unsigned long dev_hpa,
struct mm_iommu_table_group_mem_t **pmem)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
struct mm_iommu_table_group_mem_t *mem;
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
long i, ret, locked_entries = 0;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
unsigned int pageshift;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mutex_lock(&mem_list_mutex);
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
next) {
/* Overlap? */
if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
(ua < (mem->ua +
(mem->entries << PAGE_SHIFT)))) {
ret = -EINVAL;
goto unlock_exit;
}
}
if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
ret = mm_iommu_adjust_locked_vm(mm, entries, true);
if (ret)
goto unlock_exit;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
locked_entries = entries;
}
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mem = kzalloc(sizeof(*mem), GFP_KERNEL);
if (!mem) {
ret = -ENOMEM;
goto unlock_exit;
}
if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) {
mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT));
mem->dev_hpa = dev_hpa;
goto good_exit;
}
mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
/*
* For a starting point for a maximum page size calculation
* we use @ua and @entries natural alignment to allow IOMMU pages
* smaller than huge pages but still bigger than PAGE_SIZE.
*/
mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
treewide: Use array_size() in vzalloc() The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(char) * COUNT + COUNT , ...) | vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) | vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) | vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-13 05:27:37 +08:00
mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
if (!mem->hpas) {
kfree(mem);
ret = -ENOMEM;
goto unlock_exit;
}
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
down_read(&mm->mmap_sem);
ret = get_user_pages_longterm(ua, entries, FOLL_WRITE, mem->hpages, NULL);
up_read(&mm->mmap_sem);
if (ret != entries) {
/* free the reference taken */
for (i = 0; i < ret; i++)
put_page(mem->hpages[i]);
vfree(mem->hpas);
kfree(mem);
ret = -EFAULT;
goto unlock_exit;
}
pageshift = PAGE_SHIFT;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
for (i = 0; i < entries; ++i) {
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
struct page *page = mem->hpages[i];
/*
* Allow to use larger than 64k IOMMU pages. Only do that
* if we are backed by hugetlb.
*/
if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) {
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
struct page *head = compound_head(page);
pageshift = compound_order(head) + PAGE_SHIFT;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
}
mem->pageshift = min(mem->pageshift, pageshift);
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
/*
* We don't need struct page reference any more, switch
* to physical address.
*/
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
}
good_exit:
powerpc/mm/iommu: allow migration of cma allocated pages during mm_iommu_do_alloc The current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also, THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. The patch also switches to a single get_user_pages with the right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. This is done by using get_user_pages_longterm, because that also takes care of DAX backed pages. DAX pages lifetime is dictated by file system rules and as such, we need to make sure that we free these pages on operations like truncate and punch hole. If we have long term pin on these pages, which are mostly return to userspace with elevated page count, the entity holding the long term pin may not be aware of the fact that file got truncated and the file system blocks possibly got reused. That can result in corruption. The patch also converts the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. [aneesh.kumar@linux.ibm.com: address review feedback, update changelog] Link: http://lkml.kernel.org/r/20190227144736.5872-4-aneesh.kumar@linux.ibm.com Link: http://lkml.kernel.org/r/20190114095438.32470-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-06 07:47:47 +08:00
ret = 0;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
atomic64_set(&mem->mapped, 1);
mem->used = 1;
mem->ua = ua;
mem->entries = entries;
*pmem = mem;
list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
unlock_exit:
if (locked_entries && ret)
mm_iommu_adjust_locked_vm(mm, locked_entries, false);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mutex_unlock(&mem_list_mutex);
return ret;
}
long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem)
{
return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA,
pmem);
}
EXPORT_SYMBOL_GPL(mm_iommu_new);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
unsigned long entries, unsigned long dev_hpa,
struct mm_iommu_table_group_mem_t **pmem)
{
return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem);
}
EXPORT_SYMBOL_GPL(mm_iommu_newdev);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
{
long i;
struct page *page = NULL;
if (!mem->hpas)
return;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
for (i = 0; i < mem->entries; ++i) {
if (!mem->hpas[i])
continue;
page = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
if (!page)
continue;
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
put_page(page);
mem->hpas[i] = 0;
}
}
static void mm_iommu_do_free(struct mm_iommu_table_group_mem_t *mem)
{
mm_iommu_unpin(mem);
vfree(mem->hpas);
kfree(mem);
}
static void mm_iommu_free(struct rcu_head *head)
{
struct mm_iommu_table_group_mem_t *mem = container_of(head,
struct mm_iommu_table_group_mem_t, rcu);
mm_iommu_do_free(mem);
}
static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem)
{
list_del_rcu(&mem->next);
call_rcu(&mem->rcu, mm_iommu_free);
}
long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
long ret = 0;
unsigned long entries, dev_hpa;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mutex_lock(&mem_list_mutex);
if (mem->used == 0) {
ret = -ENOENT;
goto unlock_exit;
}
--mem->used;
/* There are still users, exit */
if (mem->used)
goto unlock_exit;
/* Are there still mappings? */
if (atomic_cmpxchg(&mem->mapped, 1, 0) != 1) {
++mem->used;
ret = -EBUSY;
goto unlock_exit;
}
/* @mapped became 0 so now mappings are disabled, release the region */
entries = mem->entries;
dev_hpa = mem->dev_hpa;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
mm_iommu_release(mem);
if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
mm_iommu_adjust_locked_vm(mm, entries, false);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
unlock_exit:
mutex_unlock(&mem_list_mutex);
return ret;
}
EXPORT_SYMBOL_GPL(mm_iommu_put);
struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
if ((mem->ua <= ua) &&
(ua + size <= mem->ua +
(mem->entries << PAGE_SHIFT))) {
ret = mem;
break;
}
}
return ret;
}
EXPORT_SYMBOL_GPL(mm_iommu_lookup);
struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm,
unsigned long ua, unsigned long size)
{
struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
list_for_each_entry_lockless(mem, &mm->context.iommu_group_mem_list,
next) {
if ((mem->ua <= ua) &&
(ua + size <= mem->ua +
(mem->entries << PAGE_SHIFT))) {
ret = mem;
break;
}
}
return ret;
}
struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
mutex_lock(&mem_list_mutex);
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
if ((mem->ua == ua) && (mem->entries == entries)) {
ret = mem;
++mem->used;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
break;
}
}
mutex_unlock(&mem_list_mutex);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
return ret;
}
EXPORT_SYMBOL_GPL(mm_iommu_get);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
unsigned long ua, unsigned int pageshift, unsigned long *hpa)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
const long entry = (ua - mem->ua) >> PAGE_SHIFT;
u64 *va;
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
if (entry >= mem->entries)
return -EFAULT;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
if (pageshift > mem->pageshift)
return -EFAULT;
if (!mem->hpas) {
*hpa = mem->dev_hpa + (ua - mem->ua);
return 0;
}
va = &mem->hpas[entry];
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
*hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
return 0;
}
EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
unsigned long ua, unsigned int pageshift, unsigned long *hpa)
{
const long entry = (ua - mem->ua) >> PAGE_SHIFT;
unsigned long *pa;
if (entry >= mem->entries)
return -EFAULT;
KVM: PPC: Check if IOMMU page is contained in the pinned physical page A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-17 15:19:13 +08:00
if (pageshift > mem->pageshift)
return -EFAULT;
if (!mem->hpas) {
*hpa = mem->dev_hpa + (ua - mem->ua);
return 0;
}
pa = (void *) vmalloc_to_phys(&mem->hpas[entry]);
if (!pa)
return -EFAULT;
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
*hpa = (*pa & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK);
return 0;
}
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua)
{
struct mm_iommu_table_group_mem_t *mem;
long entry;
void *va;
unsigned long *pa;
mem = mm_iommu_lookup_rm(mm, ua, PAGE_SIZE);
if (!mem)
return;
if (mem->dev_hpa != MM_IOMMU_TABLE_INVALID_HPA)
return;
KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-09-10 16:29:07 +08:00
entry = (ua - mem->ua) >> PAGE_SHIFT;
va = &mem->hpas[entry];
pa = (void *) vmalloc_to_phys(va);
if (!pa)
return;
*pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
}
bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
unsigned int pageshift, unsigned long *size)
{
struct mm_iommu_table_group_mem_t *mem;
unsigned long end;
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
continue;
end = mem->dev_hpa + (mem->entries << PAGE_SHIFT);
if ((mem->dev_hpa <= hpa) && (hpa < end)) {
/*
* Since the IOMMU page size might be bigger than
* PAGE_SIZE, the amount of preregistered memory
* starting from @hpa might be smaller than 1<<pageshift
* and the caller needs to distinguish this situation.
*/
*size = min(1UL << pageshift, end - hpa);
return true;
}
}
return false;
}
EXPORT_SYMBOL_GPL(mm_iommu_is_devmem);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem)
{
if (atomic64_inc_not_zero(&mem->mapped))
return 0;
/* Last mm_iommu_put() has been called, no more mappings allowed() */
return -ENXIO;
}
EXPORT_SYMBOL_GPL(mm_iommu_mapped_inc);
void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem)
{
atomic64_add_unless(&mem->mapped, -1, 1);
}
EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
void mm_iommu_init(struct mm_struct *mm)
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
{
INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list);
powerpc/mmu: Add userspace-to-physical addresses translation cache We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory. Another use of it is in-kernel real mode (mmu off) acceleration of DMA requests where real time translation of guest physical to host physical addresses is non-trivial and may fail as linux ptes may be temporarily invalid. Also, having cached host physical addresses (compared to just pinning at the start and then walking the page table again on every H_PUT_TCE), we can be sure that the addresses which we put into TCE table are the ones we already pinned. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage usage counters; multiple registration of the same memory is allowed (once per container). This implements 2 counters per registered memory region: - @mapped: incremented on every DMA mapping; decremented on unmapping; initialized to 1 when a region is just registered; once it becomes zero, no more mappings allowe; - @used: incremented on every "register" ioctl; decremented on "unregister"; unregistration is allowed for DMA mapped regions unless it is the very last reference. For the very last reference this checks that the region is still mapped and returns -EBUSY so the userspace gets to know that memory is still pinned and unregistration needs to be retried; @used remains 1. Host physical addresses are stored in vmalloc'ed array. In order to access these in the real mode (mmu off), there is a real_vmalloc_addr() helper. In-kernel acceleration patchset will move it from KVM to MMU code. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-05 14:35:24 +08:00
}