docs/vm: Minor editorial changes in the THP and hugetlbfs
Some minor wording changes and typo corrections. Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
7d10bdbd6d
commit
41f0a9542a
|
@ -85,10 +85,10 @@ Reservation Map Location (Private or Shared)
|
||||||
A huge page mapping or segment is either private or shared. If private,
|
A huge page mapping or segment is either private or shared. If private,
|
||||||
it is typically only available to a single address space (task). If shared,
|
it is typically only available to a single address space (task). If shared,
|
||||||
it can be mapped into multiple address spaces (tasks). The location and
|
it can be mapped into multiple address spaces (tasks). The location and
|
||||||
semantics of the reservation map is significantly different for two types
|
semantics of the reservation map is significantly different for the two types
|
||||||
of mappings. Location differences are:
|
of mappings. Location differences are:
|
||||||
|
|
||||||
- For private mappings, the reservation map hangs off the the VMA structure.
|
- For private mappings, the reservation map hangs off the VMA structure.
|
||||||
Specifically, vma->vm_private_data. This reserve map is created at the
|
Specifically, vma->vm_private_data. This reserve map is created at the
|
||||||
time the mapping (mmap(MAP_PRIVATE)) is created.
|
time the mapping (mmap(MAP_PRIVATE)) is created.
|
||||||
- For shared mappings, the reservation map hangs off the inode. Specifically,
|
- For shared mappings, the reservation map hangs off the inode. Specifically,
|
||||||
|
@ -109,15 +109,15 @@ These operations result in a call to the routine hugetlb_reserve_pages()::
|
||||||
struct vm_area_struct *vma,
|
struct vm_area_struct *vma,
|
||||||
vm_flags_t vm_flags)
|
vm_flags_t vm_flags)
|
||||||
|
|
||||||
The first thing hugetlb_reserve_pages() does is check for the NORESERVE
|
The first thing hugetlb_reserve_pages() does is check if the NORESERVE
|
||||||
flag was specified in either the shmget() or mmap() call. If NORESERVE
|
flag was specified in either the shmget() or mmap() call. If NORESERVE
|
||||||
was specified, then this routine returns immediately as no reservation
|
was specified, then this routine returns immediately as no reservations
|
||||||
are desired.
|
are desired.
|
||||||
|
|
||||||
The arguments 'from' and 'to' are huge page indices into the mapping or
|
The arguments 'from' and 'to' are huge page indices into the mapping or
|
||||||
underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
|
underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
|
||||||
the length of the segment/mapping. For mmap(), the offset argument could
|
the length of the segment/mapping. For mmap(), the offset argument could
|
||||||
be used to specify the offset into the underlying file. In such a case
|
be used to specify the offset into the underlying file. In such a case,
|
||||||
the 'from' and 'to' arguments have been adjusted by this offset.
|
the 'from' and 'to' arguments have been adjusted by this offset.
|
||||||
|
|
||||||
One of the big differences between PRIVATE and SHARED mappings is the way
|
One of the big differences between PRIVATE and SHARED mappings is the way
|
||||||
|
@ -138,7 +138,8 @@ to indicate this VMA owns the reservations.
|
||||||
|
|
||||||
The reservation map is consulted to determine how many huge page reservations
|
The reservation map is consulted to determine how many huge page reservations
|
||||||
are needed for the current mapping/segment. For private mappings, this is
|
are needed for the current mapping/segment. For private mappings, this is
|
||||||
always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
|
always the value (to - from). However, for shared mappings it is possible that
|
||||||
|
some reservations may already exist within the range (to - from). See the
|
||||||
section :ref:`Reservation Map Modifications <resv_map_modifications>`
|
section :ref:`Reservation Map Modifications <resv_map_modifications>`
|
||||||
for details on how this is accomplished.
|
for details on how this is accomplished.
|
||||||
|
|
||||||
|
@ -165,7 +166,7 @@ these counters.
|
||||||
If there were enough free huge pages and the global count resv_huge_pages
|
If there were enough free huge pages and the global count resv_huge_pages
|
||||||
was adjusted, then the reservation map associated with the mapping is
|
was adjusted, then the reservation map associated with the mapping is
|
||||||
modified to reflect the reservations. In the case of a shared mapping, a
|
modified to reflect the reservations. In the case of a shared mapping, a
|
||||||
file_region will exist that includes the range 'from' 'to'. For private
|
file_region will exist that includes the range 'from' - 'to'. For private
|
||||||
mappings, no modifications are made to the reservation map as lack of an
|
mappings, no modifications are made to the reservation map as lack of an
|
||||||
entry indicates a reservation exists.
|
entry indicates a reservation exists.
|
||||||
|
|
||||||
|
@ -239,7 +240,7 @@ subpool accounting when the page is freed.
|
||||||
The routine vma_commit_reservation() is then called to adjust the reserve
|
The routine vma_commit_reservation() is then called to adjust the reserve
|
||||||
map based on the consumption of the reservation. In general, this involves
|
map based on the consumption of the reservation. In general, this involves
|
||||||
ensuring the page is represented within a file_region structure of the region
|
ensuring the page is represented within a file_region structure of the region
|
||||||
map. For shared mappings where the the reservation was present, an entry
|
map. For shared mappings where the reservation was present, an entry
|
||||||
in the reserve map already existed so no change is made. However, if there
|
in the reserve map already existed so no change is made. However, if there
|
||||||
was no reservation in a shared mapping or this was a private mapping a new
|
was no reservation in a shared mapping or this was a private mapping a new
|
||||||
entry must be created.
|
entry must be created.
|
||||||
|
|
|
@ -4,8 +4,9 @@
|
||||||
Transparent Hugepage Support
|
Transparent Hugepage Support
|
||||||
============================
|
============================
|
||||||
|
|
||||||
This document describes design principles Transparent Hugepage (THP)
|
This document describes design principles for Transparent Hugepage (THP)
|
||||||
Support and its interaction with other parts of the memory management.
|
support and its interaction with other parts of the memory management
|
||||||
|
system.
|
||||||
|
|
||||||
Design principles
|
Design principles
|
||||||
=================
|
=================
|
||||||
|
@ -37,23 +38,23 @@ get_user_pages and follow_page
|
||||||
|
|
||||||
get_user_pages and follow_page if run on a hugepage, will return the
|
get_user_pages and follow_page if run on a hugepage, will return the
|
||||||
head or tail pages as usual (exactly as they would do on
|
head or tail pages as usual (exactly as they would do on
|
||||||
hugetlbfs). Most gup users will only care about the actual physical
|
hugetlbfs). Most GUP users will only care about the actual physical
|
||||||
address of the page and its temporary pinning to release after the I/O
|
address of the page and its temporary pinning to release after the I/O
|
||||||
is complete, so they won't ever notice the fact the page is huge. But
|
is complete, so they won't ever notice the fact the page is huge. But
|
||||||
if any driver is going to mangle over the page structure of the tail
|
if any driver is going to mangle over the page structure of the tail
|
||||||
page (like for checking page->mapping or other bits that are relevant
|
page (like for checking page->mapping or other bits that are relevant
|
||||||
for the head page and not the tail page), it should be updated to jump
|
for the head page and not the tail page), it should be updated to jump
|
||||||
to check head page instead. Taking reference on any head/tail page would
|
to check head page instead. Taking a reference on any head/tail page would
|
||||||
prevent page from being split by anyone.
|
prevent the page from being split by anyone.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
these aren't new constraints to the GUP API, and they match the
|
these aren't new constraints to the GUP API, and they match the
|
||||||
same constrains that applies to hugetlbfs too, so any driver capable
|
same constraints that apply to hugetlbfs too, so any driver capable
|
||||||
of handling GUP on hugetlbfs will also work fine on transparent
|
of handling GUP on hugetlbfs will also work fine on transparent
|
||||||
hugepage backed mappings.
|
hugepage backed mappings.
|
||||||
|
|
||||||
In case you can't handle compound pages if they're returned by
|
In case you can't handle compound pages if they're returned by
|
||||||
follow_page, the FOLL_SPLIT bit can be specified as parameter to
|
follow_page, the FOLL_SPLIT bit can be specified as a parameter to
|
||||||
follow_page, so that it will split the hugepages before returning
|
follow_page, so that it will split the hugepages before returning
|
||||||
them.
|
them.
|
||||||
|
|
||||||
|
@ -66,11 +67,11 @@ pmd_offset. It's trivial to make the code transparent hugepage aware
|
||||||
by just grepping for "pmd_offset" and adding split_huge_pmd where
|
by just grepping for "pmd_offset" and adding split_huge_pmd where
|
||||||
missing after pmd_offset returns the pmd. Thanks to the graceful
|
missing after pmd_offset returns the pmd. Thanks to the graceful
|
||||||
fallback design, with a one liner change, you can avoid to write
|
fallback design, with a one liner change, you can avoid to write
|
||||||
hundred if not thousand of lines of complex code to make your code
|
hundreds if not thousands of lines of complex code to make your code
|
||||||
hugepage aware.
|
hugepage aware.
|
||||||
|
|
||||||
If you're not walking pagetables but you run into a physical hugepage
|
If you're not walking pagetables but you run into a physical hugepage
|
||||||
but you can't handle it natively in your code, you can split it by
|
that you can't handle natively in your code, you can split it by
|
||||||
calling split_huge_page(page). This is what the Linux VM does before
|
calling split_huge_page(page). This is what the Linux VM does before
|
||||||
it tries to swapout the hugepage for example. split_huge_page() can fail
|
it tries to swapout the hugepage for example. split_huge_page() can fail
|
||||||
if the page is pinned and you must handle this correctly.
|
if the page is pinned and you must handle this correctly.
|
||||||
|
@ -97,18 +98,18 @@ split_huge_page() or split_huge_pmd() has a cost.
|
||||||
|
|
||||||
To make pagetable walks huge pmd aware, all you need to do is to call
|
To make pagetable walks huge pmd aware, all you need to do is to call
|
||||||
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
|
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
|
||||||
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
|
mmap_sem in read (or write) mode to be sure a huge pmd cannot be
|
||||||
created from under you by khugepaged (khugepaged collapse_huge_page
|
created from under you by khugepaged (khugepaged collapse_huge_page
|
||||||
takes the mmap_sem in write mode in addition to the anon_vma lock). If
|
takes the mmap_sem in write mode in addition to the anon_vma lock). If
|
||||||
pmd_trans_huge returns false, you just fallback in the old code
|
pmd_trans_huge returns false, you just fallback in the old code
|
||||||
paths. If instead pmd_trans_huge returns true, you have to take the
|
paths. If instead pmd_trans_huge returns true, you have to take the
|
||||||
page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
|
page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
|
||||||
page table lock will prevent the huge pmd to be converted into a
|
page table lock will prevent the huge pmd being converted into a
|
||||||
regular pmd from under you (split_huge_pmd can run in parallel to the
|
regular pmd from under you (split_huge_pmd can run in parallel to the
|
||||||
pagetable walk). If the second pmd_trans_huge returns false, you
|
pagetable walk). If the second pmd_trans_huge returns false, you
|
||||||
should just drop the page table lock and fallback to the old code as
|
should just drop the page table lock and fallback to the old code as
|
||||||
before. Otherwise you can proceed to process the huge pmd and the
|
before. Otherwise, you can proceed to process the huge pmd and the
|
||||||
hugepage natively. Once finished you can drop the page table lock.
|
hugepage natively. Once finished, you can drop the page table lock.
|
||||||
|
|
||||||
Refcounts and transparent huge pages
|
Refcounts and transparent huge pages
|
||||||
====================================
|
====================================
|
||||||
|
@ -116,61 +117,61 @@ Refcounts and transparent huge pages
|
||||||
Refcounting on THP is mostly consistent with refcounting on other compound
|
Refcounting on THP is mostly consistent with refcounting on other compound
|
||||||
pages:
|
pages:
|
||||||
|
|
||||||
- get_page()/put_page() and GUP operate in head page's ->_refcount.
|
- get_page()/put_page() and GUP operate on head page's ->_refcount.
|
||||||
|
|
||||||
- ->_refcount in tail pages is always zero: get_page_unless_zero() never
|
- ->_refcount in tail pages is always zero: get_page_unless_zero() never
|
||||||
succeed on tail pages.
|
succeeds on tail pages.
|
||||||
|
|
||||||
- map/unmap of the pages with PTE entry increment/decrement ->_mapcount
|
- map/unmap of the pages with PTE entry increment/decrement ->_mapcount
|
||||||
on relevant sub-page of the compound page.
|
on relevant sub-page of the compound page.
|
||||||
|
|
||||||
- map/unmap of the whole compound page accounted in compound_mapcount
|
- map/unmap of the whole compound page is accounted for in compound_mapcount
|
||||||
(stored in first tail page). For file huge pages, we also increment
|
(stored in first tail page). For file huge pages, we also increment
|
||||||
->_mapcount of all sub-pages in order to have race-free detection of
|
->_mapcount of all sub-pages in order to have race-free detection of
|
||||||
last unmap of subpages.
|
last unmap of subpages.
|
||||||
|
|
||||||
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
|
PageDoubleMap() indicates that the page is *possibly* mapped with PTEs.
|
||||||
|
|
||||||
For anonymous pages PageDoubleMap() also indicates ->_mapcount in all
|
For anonymous pages, PageDoubleMap() also indicates ->_mapcount in all
|
||||||
subpages is offset up by one. This additional reference is required to
|
subpages is offset up by one. This additional reference is required to
|
||||||
get race-free detection of unmap of subpages when we have them mapped with
|
get race-free detection of unmap of subpages when we have them mapped with
|
||||||
both PMDs and PTEs.
|
both PMDs and PTEs.
|
||||||
|
|
||||||
This is optimization required to lower overhead of per-subpage mapcount
|
This optimization is required to lower the overhead of per-subpage mapcount
|
||||||
tracking. The alternative is alter ->_mapcount in all subpages on each
|
tracking. The alternative is to alter ->_mapcount in all subpages on each
|
||||||
map/unmap of the whole compound page.
|
map/unmap of the whole compound page.
|
||||||
|
|
||||||
For anonymous pages, we set PG_double_map when a PMD of the page got split
|
For anonymous pages, we set PG_double_map when a PMD of the page is split
|
||||||
for the first time, but still have PMD mapping. The additional references
|
for the first time, but still have a PMD mapping. The additional references
|
||||||
go away with last compound_mapcount.
|
go away with the last compound_mapcount.
|
||||||
|
|
||||||
File pages get PG_double_map set on first map of the page with PTE and
|
File pages get PG_double_map set on the first map of the page with PTE and
|
||||||
goes away when the page gets evicted from page cache.
|
goes away when the page gets evicted from the page cache.
|
||||||
|
|
||||||
split_huge_page internally has to distribute the refcounts in the head
|
split_huge_page internally has to distribute the refcounts in the head
|
||||||
page to the tail pages before clearing all PG_head/tail bits from the page
|
page to the tail pages before clearing all PG_head/tail bits from the page
|
||||||
structures. It can be done easily for refcounts taken by page table
|
structures. It can be done easily for refcounts taken by page table
|
||||||
entries. But we don't have enough information on how to distribute any
|
entries, but we don't have enough information on how to distribute any
|
||||||
additional pins (i.e. from get_user_pages). split_huge_page() fails any
|
additional pins (i.e. from get_user_pages). split_huge_page() fails any
|
||||||
requests to split pinned huge page: it expects page count to be equal to
|
requests to split pinned huge pages: it expects page count to be equal to
|
||||||
sum of mapcount of all sub-pages plus one (split_huge_page caller must
|
the sum of mapcount of all sub-pages plus one (split_huge_page caller must
|
||||||
have reference for head page).
|
have a reference to the head page).
|
||||||
|
|
||||||
split_huge_page uses migration entries to stabilize page->_refcount and
|
split_huge_page uses migration entries to stabilize page->_refcount and
|
||||||
page->_mapcount of anonymous pages. File pages just got unmapped.
|
page->_mapcount of anonymous pages. File pages just get unmapped.
|
||||||
|
|
||||||
We safe against physical memory scanners too: the only legitimate way
|
We are safe against physical memory scanners too: the only legitimate way
|
||||||
scanner can get reference to a page is get_page_unless_zero().
|
a scanner can get a reference to a page is get_page_unless_zero().
|
||||||
|
|
||||||
All tail pages have zero ->_refcount until atomic_add(). This prevents the
|
All tail pages have zero ->_refcount until atomic_add(). This prevents the
|
||||||
scanner from getting a reference to the tail page up to that point. After the
|
scanner from getting a reference to the tail page up to that point. After the
|
||||||
atomic_add() we don't care about the ->_refcount value. We already known how
|
atomic_add() we don't care about the ->_refcount value. We already know how
|
||||||
many references should be uncharged from the head page.
|
many references should be uncharged from the head page.
|
||||||
|
|
||||||
For head page get_page_unless_zero() will succeed and we don't mind. It's
|
For head page get_page_unless_zero() will succeed and we don't mind. It's
|
||||||
clear where reference should go after split: it will stay on head page.
|
clear where references should go after split: it will stay on the head page.
|
||||||
|
|
||||||
Note that split_huge_pmd() doesn't have any limitation on refcounting:
|
Note that split_huge_pmd() doesn't have any limitations on refcounting:
|
||||||
pmd can be split at any point and never fails.
|
pmd can be split at any point and never fails.
|
||||||
|
|
||||||
Partial unmap and deferred_split_huge_page()
|
Partial unmap and deferred_split_huge_page()
|
||||||
|
@ -182,10 +183,10 @@ in page_remove_rmap() and queue the THP for splitting if memory pressure
|
||||||
comes. Splitting will free up unused subpages.
|
comes. Splitting will free up unused subpages.
|
||||||
|
|
||||||
Splitting the page right away is not an option due to locking context in
|
Splitting the page right away is not an option due to locking context in
|
||||||
the place where we can detect partial unmap. It's also might be
|
the place where we can detect partial unmap. It also might be
|
||||||
counterproductive since in many cases partial unmap happens during exit(2) if
|
counterproductive since in many cases partial unmap happens during exit(2) if
|
||||||
a THP crosses a VMA boundary.
|
a THP crosses a VMA boundary.
|
||||||
|
|
||||||
Function deferred_split_huge_page() is used to queue page for splitting.
|
The function deferred_split_huge_page() is used to queue a page for splitting.
|
||||||
The splitting itself will happen when we get memory pressure via shrinker
|
The splitting itself will happen when we get memory pressure via shrinker
|
||||||
interface.
|
interface.
|
||||||
|
|
Loading…
Reference in New Issue