try_to_compact_pages() can currently return COMPACT_SKIPPED even when
the compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.
Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.
As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"
mm/compaction.c: In function `compact_zone':
mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
switch (ret) {
^
mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]
compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction code is doing weird dances between COMPACT_FOO -> int ->
unsigned long
But there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value so it
really makes sense to define an enum for them and make it clear that no
other values are expected.
This is a pure cleanup and shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Motivation:
As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.
This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the
implementation is described in the first patch.
I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.
Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.
1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
file size, removes the files and starts over again) running in
parallel for 10s to build up a lot of dirty pages when 100 parallel
mem_eaters (anon private populated mmap which waits until it gets
signal) with 80M each.
This causes an OOM flood of course and I have compared both patched
and unpatched kernels. The test is considered finished after there
are no OOM conditions detected. This should tell us whether there are
any excessive kills or some of them premature (e.g. due to dirty pages):
I have performed two runs this time each after a fresh boot.
* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78
$ grep "Kill process" base-oom-run1.log | tail -n1
[ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3
* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77
e grep "Kill process" patched-oom-run1.log | tail -n1
[ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3
The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.
2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
memory as possible without triggering the OOM killer. This required a lot
of tuning but I've considered 3 consecutive runs in three different boots
without OOM as a success.
* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.
I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.
I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.
This patch (of 14):
While playing with the oom detection rework [1] I have noticed that my
heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
where the reclaim hasn't made any progress but did_some_progress didn't
reflect that and compaction_suitable was backing off because no zone is
above low wmark + 1 << order.
It turned out that this is in fact an old standing bug in
compaction_ready which ignores the requested_highidx and did the
watermark check for 0 classzone_idx. This succeeds for zone DMA most
of the time as the zone is mostly unused because of lowmem protection.
As a result costly high order allocatios always report a successfull
progress even when there was none. This wasn't a problem so far
because these allocations usually fail quite early or retry only few
times with __GFP_REPEAT but this will change after later patch in this
series so make sure to not lie about the progress and propagate
requested_highidx down to compaction_ready and use it for both the
watermak check and compaction_suitable to fix this issue.
[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
[2] https://lkml.org/lkml/2015/10/12/808
[3] https://lkml.org/lkml/2015/10/13/597
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages. The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.
With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.
This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set. This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.
[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andres observed that his database workload is struggling with the
transaction journal creating pressure on frequently read pages.
Access patterns like transaction journals frequently write the same
pages over and over, but in the majority of cases those pages are never
read back. There are no caching benefits to be had for those pages, so
activating them and having them put pressure on pages that do benefit
from caching is a bad choice.
Leave page activations to read accesses and don't promote pages based on
writes alone.
It could be said that partially written pages do contain cache-worthy
data, because even if *userspace* does not access the unwritten part,
the kernel still has to read it from the filesystem for correctness.
However, a counter argument is that these pages enjoy at least *some*
protection over other inactive file pages through the writeback cache,
in the sense that dirty pages are written back with a delay and cache
reclaim leaves them alone until they have been written back to disk.
Should that turn out to be insufficient and we see increased read IO
from partial writes under memory pressure, we can always go back and
update grab_cache_page_write_begin() to take (pos, len) so that it can
tell partial writes from pages that don't need partial reads. But for
now, keep it simple.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a follow-up to
http://www.spinics.net/lists/linux-mm/msg101739.html
where Andres reported his database workingset being pushed out by the
minimum size enforcement of the inactive file list - currently 50% of
cache - as well as repeatedly written file pages that are never actually
read.
Two changes fell out of the discussions. The first change observes that
pages that are only ever written don't benefit from caching beyond what
the writeback cache does for partial page writes, and so we shouldn't
promote them to the active file list where they compete with pages whose
cached data is actually accessed repeatedly. This change comes in two
patches - one for in-cache write accesses and one for refaults triggered
by writes, neither of which should promote a cache page.
Second, with the refault detection we don't need to set 50% of the cache
aside for used-once cache anymore since we can detect frequently used
pages even when they are evicted between accesses. We can allow the
active list to be bigger and thus protect a bigger workingset that isn't
challenged by streamers. Depending on the access patterns, this can
increase major faults during workingset transitions for better
performance during stable phases.
This patch (of 3):
When rewriting a page, the data in that page is replaced with new data.
This means that evicting something else from the active file list, in
order to cache data that will be replaced by something else, is likely
to be a waste of memory.
It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.
This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance gain
obtained from better caching pages that see repeated partial writes at
large enough intervals to not get caught by the use-twice promotion code
used for the inactive file list.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* imxfb: fix lcd power up
* small fixes and cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPshLAAoJEPo9qoy8lh71QNIQAKlSAb0C4PUK67NaymL3w1FO
RhCAsmAeJdODV2iK7RhsKMUSnlFPkCZ3urFJz7tHv7B3k09QkzBBSHQdQpF6mWyu
+Rna7NehCO44k+NM7zTnd3dbL5dq8H7wWCLKambHHWC8/tcsD+DLcoKHsFwdcZwe
WeWA38R09uH0z6idPYv2vyjMH+cHcAE86SOaeQ/8jgqDmlX8kkXRVPlHaowhUYio
dap9gtiy/+OWOIVDRTV7uWKtKZwt3KTIafCpZxRuynqZWET0jurDAyGiUqvXTGvC
0RCuDsKMAhiysLkXRFRRBAeo6Yj6TGsGCNyLec/SE8VZbOxpfUJx+INQg/oqOdSq
xg3PvMtAJH7JiGxSHZZGAR7mxYuzmWW8dXMk/Fu7UZgkCCiLbjdDmytAY5T6qTHz
4Nq5omczxpj38d3UbcljXmNwudmIEes82sIYOalN2nxCEXuycCWyQO0U0swmkaqH
4ISSJY8Gej10QXO8mivv1q9y9BaOJfUeoqtyNR4zQAiulfjZIbtyLeX3FRKLTjI/
50K2YCGm06GP9CHGefzVl3PBgeZ49NGegu+Z8sRfYxAac3R0V4QaE+40ygFaOHCY
Gq5dvEKXgJeq2i3nanlKwZkr7kB0FF/XqaH9RSYO7TaoyOg8GgeZYcj1sGzJwplu
c4lRcKRlzRShV2HymHFv
=iHUZ
-----END PGP SIGNATURE-----
Merge tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev updates from Tomi Valkeinen:
- imxfb: fix lcd power up
- small fixes and cleanups
* tag 'fbdev-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
fbdev: Use IS_ENABLED() instead of checking for built-in or module
efifb: Don't show the mapping VA
video: AMBA CLCD: Remove unncessary include in amba-clcd.c
fbdev: ssd1307fb: Fix charge pump setting
Documentation: fb: fix spelling mistakes
fbdev: fbmem: implement error handling in fbmem_init()
fbdev: sh_mipi_dsi: remove driver
video: fbdev: imxfb: add some error handling
video: fbdev: imxfb: fix semantic of .get_power and .set_power
video: fbdev: omap2: Remove deprecated regulator_can_change_voltage() usage
Pull crypto fix from Herbert Xu:
"Fix a regression that causes sha-mb to crash"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: sha1-mb - make sha1_x8_avx2() conform to C function ABI
The newly added nps irqchip driver causes build warnings on ARM64.
include/soc/nps/common.h: In function 'nps_host_reg_non_cl':
include/soc/nps/common.h:148:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
As the driver is only used on ARC, we don't need to see it without
COMPILE_TEST elsewhere, and we can avoid the warnings by only building
on 32-bit architectures even with CONFIG_COMPILE_TEST.
Acked-by: Marc Zyngier <narc.zyngier@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
- Check periodically the coherent platform function's state from Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
workaround, and a kconfig dependency fix."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
/AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
/ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
udQES+t3F/9gvtxgxtDe
=YvyQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
Pull ARM updates from Russell King:
"Changes included in this pull request:
- revert pxa2xx-flash back to using ioremap_cached() and switch
memremap() to use arch_memremap_wb()
- remove pci=firmware command line argument handling
- remove unnecessary arm_dma_set_mask() implementation, the generic
implementation will do for ARM
- removal of the ARM kallsyms "hack" to work around mode switching
veneers and vectors located below PAGE_OFFSET
- tidy up build system output a little
- add L2 cache power management DT bindings
- remove duplicated local_irq_disable() in reboot paths
- handle AMBA primecell devices better at registration time with PM
domains (needed for Samsung SoCs)
- ARM specific preparation to support Keystone II kexec"
* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8567/1: cache-uniphier: activate ways for secondary CPUs
ARM: 8570/2: Documentation: devicetree: Add PL310 PM bindings
ARM: 8569/1: pl2x0: Add OF control of cache power management
ARM: 8568/1: reboot: remove duplicated local_irq_disable()
ARM: 8566/1: drivers: amba: properly handle devices with power domains
ARM: provide arm_has_idmap_alias() helper
ARM: kexec: remove 512MB restriction on kexec crashdump
ARM: provide improved virt_to_idmap() functionality
ARM: kexec: fix crashkernel= handling
ARM: 8557/1: specify install, zinstall, and uinstall as PHONY targets
ARM: 8562/1: suppress "include/generated/mach-types.h is up to date."
ARM: 8553/1: kallsyms: remove --page-offset command line option
ARM: 8552/1: kallsyms: remove special lower address limit for CONFIG_ARM
ARM: 8555/1: kallsyms: ignore ARM mode switching veneers
ARM: 8548/1: dma-mapping: remove arm_dma_set_mask()
ARM: 8554/1: kernel: pci: remove pci=firmware command line parameter handling
ARM: memremap: implement arch_memremap_wb()
memremap: add arch specific hook for MEMREMAP_WB mappings
mtd: pxa2xx-flash: switch back from memremap to ioremap_cached
ARM: reintroduce ioremap_cached() for creating cached I/O mappings
Merge updates from Andrew Morton:
- fsnotify fix
- poll() timeout fix
- a few scripts/ tweaks
- debugobjects updates
- the (small) ocfs2 queue
- Minor fixes to kernel/padata.c
- Maybe half of the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm, page_alloc: restore the original nodemask if the fast path allocation failed
mm, page_alloc: uninline the bad page part of check_new_page()
mm, page_alloc: don't duplicate code in free_pcp_prepare
mm, page_alloc: defer debugging checks of pages allocated from the PCP
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
cpuset: use static key better and convert to new API
mm, page_alloc: inline pageblock lookup in page free fast paths
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
mm, page_alloc: pull out side effects from free_pages_check
mm, page_alloc: un-inline the bad part of free_pages_check
mm, page_alloc: check multiple page fields with a single branch
mm, page_alloc: remove field from alloc_context
mm, page_alloc: avoid looking up the first zone in a zonelist twice
mm, page_alloc: shortcut watermark checks for order-0 pages
mm, page_alloc: reduce cost of fair zone allocation policy retry
mm, page_alloc: shorten the page allocator fast path
mm, page_alloc: check once if a zone has isolated pageblocks
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
mm, page_alloc: simplify last cpupid reset
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
...
The page allocator fast path uses either the requested nodemask or
cpuset_current_mems_allowed if cpusets are enabled. If the allocation
context allows watermarks to be ignored then it can also ignore memory
policies. However, on entering the allocator slowpath the nodemask may
still be cpuset_current_mems_allowed and the policies are enforced.
This patch resets the nodemask appropriately before entering the
slowpath.
Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bad pages should be rare so the code handling them doesn't need to be
inline for performance reasons. Put it to separate function which
returns void. This also assumes that the initial page_expected_state()
result will match the result of the thorough check, i.e. the page
doesn't become "good" in the meanwhile. This matches the same
expectations already in place in free_pages_check().
!DEBUG_VM bloat-o-meter:
add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
function old new delta
check_new_page_bad - 134 +134
get_page_from_freelist 3468 3194 -274
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new free_pcp_prepare() function shares a lot of code with
free_pages_prepare(), which makes this a maintenance risk when some
future patch modifies only one of them. We should be able to achieve
the same effect (skipping free_pages_check() from !DEBUG_VM configs) by
adding a parameter to free_pages_prepare() and making it inline, so the
checks (and the order != 0 parts) are eliminated from the call from
free_pcp_prepare().
!DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already
inlining free_pages_prepare() and the elimination seems to work as
expected
DEBUG_VM bloat-o-meter:
add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257)
function old new delta
__free_pages_ok 297 1060 +763
free_hot_cold_page 480 752 +272
free_pages_prepare 778 - -778
Here inlining didn't occur before, and added some code, but it's ok for
a debug option.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every page allocated checks a number of page fields for validity. This
catches corruption bugs of pages that are already freed but it is
expensive. This patch weakens the debugging check by checking PCP pages
only when the PCP lists are being refilled. All compound pages are
checked. This potentially avoids debugging checks entirely if the PCP
lists are never emptied and refilled so some corruption issues may be
missed. Full checking requires DEBUG_VM.
With the two deferred debugging patches applied, the impact to a page
allocator microbenchmark is
4.6.0-rc3 4.6.0-rc3
inline-v3r6 deferalloc-v3r7
Min alloc-odr0-1 344.00 ( 0.00%) 317.00 ( 7.85%)
Min alloc-odr0-2 248.00 ( 0.00%) 231.00 ( 6.85%)
Min alloc-odr0-4 209.00 ( 0.00%) 192.00 ( 8.13%)
Min alloc-odr0-8 181.00 ( 0.00%) 166.00 ( 8.29%)
Min alloc-odr0-16 168.00 ( 0.00%) 154.00 ( 8.33%)
Min alloc-odr0-32 161.00 ( 0.00%) 148.00 ( 8.07%)
Min alloc-odr0-64 158.00 ( 0.00%) 145.00 ( 8.23%)
Min alloc-odr0-128 156.00 ( 0.00%) 143.00 ( 8.33%)
Min alloc-odr0-256 168.00 ( 0.00%) 154.00 ( 8.33%)
Min alloc-odr0-512 178.00 ( 0.00%) 167.00 ( 6.18%)
Min alloc-odr0-1024 186.00 ( 0.00%) 174.00 ( 6.45%)
Min alloc-odr0-2048 192.00 ( 0.00%) 180.00 ( 6.25%)
Min alloc-odr0-4096 198.00 ( 0.00%) 184.00 ( 7.07%)
Min alloc-odr0-8192 200.00 ( 0.00%) 188.00 ( 6.00%)
Min alloc-odr0-16384 201.00 ( 0.00%) 188.00 ( 6.47%)
Min free-odr0-1 189.00 ( 0.00%) 180.00 ( 4.76%)
Min free-odr0-2 132.00 ( 0.00%) 126.00 ( 4.55%)
Min free-odr0-4 104.00 ( 0.00%) 99.00 ( 4.81%)
Min free-odr0-8 90.00 ( 0.00%) 85.00 ( 5.56%)
Min free-odr0-16 84.00 ( 0.00%) 80.00 ( 4.76%)
Min free-odr0-32 80.00 ( 0.00%) 76.00 ( 5.00%)
Min free-odr0-64 78.00 ( 0.00%) 74.00 ( 5.13%)
Min free-odr0-128 77.00 ( 0.00%) 73.00 ( 5.19%)
Min free-odr0-256 94.00 ( 0.00%) 91.00 ( 3.19%)
Min free-odr0-512 108.00 ( 0.00%) 112.00 ( -3.70%)
Min free-odr0-1024 115.00 ( 0.00%) 118.00 ( -2.61%)
Min free-odr0-2048 120.00 ( 0.00%) 125.00 ( -4.17%)
Min free-odr0-4096 123.00 ( 0.00%) 129.00 ( -4.88%)
Min free-odr0-8192 126.00 ( 0.00%) 130.00 ( -3.17%)
Min free-odr0-16384 126.00 ( 0.00%) 131.00 ( -3.97%)
Note that the free paths for large numbers of pages is impacted as the
debugging cost gets shifted into that path when the page data is no
longer necessarily cache-hot.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every page free checks a number of page fields for validity. This
catches premature frees and corruptions but it is also expensive. This
patch weakens the debugging check by checking PCP pages at the time they
are drained from the PCP list. This will trigger the bug but the site
that freed the corrupt page will be lost. To get the full context, a
kernel rebuild with DEBUG_VM is necessary.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An important function for cpusets is cpuset_node_allowed(), which
optimizes on the fact if there's a single root CPU set, it must be
trivially allowed. But the check "nr_cpusets() <= 1" doesn't use the
cpusets_enabled_key static key the right way where static keys eliminate
branching overhead with jump labels.
This patch converts it so that static key is used properly. It's also
switched to the new static key API and the checking functions are
converted to return bool instead of int. We also provide a new variant
__cpuset_zone_allowed() which expects that the static key check was
already done and they key was enabled. This is needed for
get_page_from_freelist() where we want to also avoid the relatively
slower check when ALLOC_CPUSET is not set in alloc_flags.
The impact on the page allocator microbenchmark is less than expected
but the cleanup in itself is worthwhile.
4.6.0-rc2 4.6.0-rc2
multcheck-v1r20 cpuset-v1r20
Min alloc-odr0-1 348.00 ( 0.00%) 348.00 ( 0.00%)
Min alloc-odr0-2 254.00 ( 0.00%) 254.00 ( 0.00%)
Min alloc-odr0-4 213.00 ( 0.00%) 213.00 ( 0.00%)
Min alloc-odr0-8 186.00 ( 0.00%) 183.00 ( 1.61%)
Min alloc-odr0-16 173.00 ( 0.00%) 171.00 ( 1.16%)
Min alloc-odr0-32 166.00 ( 0.00%) 163.00 ( 1.81%)
Min alloc-odr0-64 162.00 ( 0.00%) 159.00 ( 1.85%)
Min alloc-odr0-128 160.00 ( 0.00%) 157.00 ( 1.88%)
Min alloc-odr0-256 169.00 ( 0.00%) 166.00 ( 1.78%)
Min alloc-odr0-512 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-1024 188.00 ( 0.00%) 187.00 ( 0.53%)
Min alloc-odr0-2048 194.00 ( 0.00%) 193.00 ( 0.52%)
Min alloc-odr0-4096 199.00 ( 0.00%) 198.00 ( 0.50%)
Min alloc-odr0-8192 202.00 ( 0.00%) 201.00 ( 0.50%)
Min alloc-odr0-16384 203.00 ( 0.00%) 202.00 ( 0.49%)
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function call overhead of get_pfnblock_flags_mask() is measurable in
the page free paths. This patch uses an inlined version that is faster.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original count is never reused so it can be removed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Check without side-effects should be easier to maintain. It also
removes the duplicated cpupid and flags reset done in !DEBUG_VM variant
of both free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it
enables the next patch.
It shouldn't result in new branches, thanks to inlining of the check.
!DEBUG_VM bloat-o-meter:
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27)
function old new delta
__free_pages_ok 748 739 -9
free_pcppages_bulk 1403 1385 -18
DEBUG_VM:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28)
function old new delta
free_pages_prepare 806 778 -28
This is also slightly faster because cpupid information is not set on
tail pages so we can avoid resets there.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every page allocated or freed is checked for sanity to avoid corruptions
that are difficult to detect later. A bad page could be due to a number
of fields. Instead of using multiple branches, this patch combines
multiple fields into a single branch. A detailed check is only
necessary if that check fails.
4.6.0-rc2 4.6.0-rc2
initonce-v1r20 multcheck-v1r20
Min alloc-odr0-1 359.00 ( 0.00%) 348.00 ( 3.06%)
Min alloc-odr0-2 260.00 ( 0.00%) 254.00 ( 2.31%)
Min alloc-odr0-4 214.00 ( 0.00%) 213.00 ( 0.47%)
Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
Min alloc-odr0-32 165.00 ( 0.00%) 166.00 ( -0.61%)
Min alloc-odr0-64 162.00 ( 0.00%) 162.00 ( 0.00%)
Min alloc-odr0-128 161.00 ( 0.00%) 160.00 ( 0.62%)
Min alloc-odr0-256 170.00 ( 0.00%) 169.00 ( 0.59%)
Min alloc-odr0-512 181.00 ( 0.00%) 180.00 ( 0.55%)
Min alloc-odr0-1024 190.00 ( 0.00%) 188.00 ( 1.05%)
Min alloc-odr0-2048 196.00 ( 0.00%) 194.00 ( 1.02%)
Min alloc-odr0-4096 202.00 ( 0.00%) 199.00 ( 1.49%)
Min alloc-odr0-8192 205.00 ( 0.00%) 202.00 ( 1.46%)
Min alloc-odr0-16384 205.00 ( 0.00%) 203.00 ( 0.98%)
Again, the benefit is marginal but avoiding excessive branches is
important. Ideally the paths would not have to check these conditions
at all but regrettably abandoning the tests would make use-after-free
bugs much harder to detect.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The classzone_idx can be inferred from preferred_zoneref so remove the
unnecessary field and save stack space.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The allocator fast path looks up the first usable zone in a zonelist and
then get_page_from_freelist does the same job in the zonelist iterator.
This patch preserves the necessary information.
4.6.0-rc2 4.6.0-rc2
fastmark-v1r20 initonce-v1r20
Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)
The benefit is negligible and the results are within the noise but each
cycle counts.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Watermarks have to be checked on every allocation including the number
of pages being allocated and whether reserves can be accessed. The
reserves only matter if memory is limited and the free_pages adjustment
only applies to high-order pages. This patch adds a shortcut for
order-0 pages that avoids numerous calculations if there is plenty of
free memory yielding the following performance difference in a page
allocator microbenchmark;
4.6.0-rc2 4.6.0-rc2
optfair-v1r20 fastmark-v1r20
Min alloc-odr0-1 380.00 ( 0.00%) 364.00 ( 4.21%)
Min alloc-odr0-2 273.00 ( 0.00%) 262.00 ( 4.03%)
Min alloc-odr0-4 227.00 ( 0.00%) 214.00 ( 5.73%)
Min alloc-odr0-8 196.00 ( 0.00%) 186.00 ( 5.10%)
Min alloc-odr0-16 183.00 ( 0.00%) 173.00 ( 5.46%)
Min alloc-odr0-32 173.00 ( 0.00%) 165.00 ( 4.62%)
Min alloc-odr0-64 169.00 ( 0.00%) 161.00 ( 4.73%)
Min alloc-odr0-128 169.00 ( 0.00%) 159.00 ( 5.92%)
Min alloc-odr0-256 180.00 ( 0.00%) 168.00 ( 6.67%)
Min alloc-odr0-512 190.00 ( 0.00%) 180.00 ( 5.26%)
Min alloc-odr0-1024 198.00 ( 0.00%) 190.00 ( 4.04%)
Min alloc-odr0-2048 204.00 ( 0.00%) 196.00 ( 3.92%)
Min alloc-odr0-4096 209.00 ( 0.00%) 202.00 ( 3.35%)
Min alloc-odr0-8192 213.00 ( 0.00%) 206.00 ( 3.29%)
Min alloc-odr0-16384 214.00 ( 0.00%) 206.00 ( 3.74%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fair zone allocation policy is not without cost but it can be
reduced slightly. This patch removes an unnecessary local variable,
checks the likely conditions of the fair zone policy first, uses a bool
instead of a flags check and falls through when a remote node is
encountered instead of doing a full restart. The benefit is marginal
but it's there
4.6.0-rc2 4.6.0-rc2
decstat-v1r20 optfair-v1r20
Min alloc-odr0-1 377.00 ( 0.00%) 380.00 ( -0.80%)
Min alloc-odr0-2 273.00 ( 0.00%) 273.00 ( 0.00%)
Min alloc-odr0-4 226.00 ( 0.00%) 227.00 ( -0.44%)
Min alloc-odr0-8 196.00 ( 0.00%) 196.00 ( 0.00%)
Min alloc-odr0-16 183.00 ( 0.00%) 183.00 ( 0.00%)
Min alloc-odr0-32 175.00 ( 0.00%) 173.00 ( 1.14%)
Min alloc-odr0-64 172.00 ( 0.00%) 169.00 ( 1.74%)
Min alloc-odr0-128 170.00 ( 0.00%) 169.00 ( 0.59%)
Min alloc-odr0-256 183.00 ( 0.00%) 180.00 ( 1.64%)
Min alloc-odr0-512 191.00 ( 0.00%) 190.00 ( 0.52%)
Min alloc-odr0-1024 199.00 ( 0.00%) 198.00 ( 0.50%)
Min alloc-odr0-2048 204.00 ( 0.00%) 204.00 ( 0.00%)
Min alloc-odr0-4096 210.00 ( 0.00%) 209.00 ( 0.48%)
Min alloc-odr0-8192 213.00 ( 0.00%) 213.00 ( 0.00%)
Min alloc-odr0-16384 214.00 ( 0.00%) 214.00 ( 0.00%)
The benefit is marginal at best but one of the most important benefits,
avoiding a second search when falling back to another node is not
triggered by this particular test so the benefit for some corner cases
is understated.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator fast path checks page multiple times unnecessarily.
This patch avoids all the slowpath checks if the first allocation
attempt succeeds.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When bulk freeing pages from the per-cpu lists the zone is checked for
isolated pageblocks on every release. This patch checks it once per
drain.
[mgorman@techsingularity.net: fix locking radce, per Vlastimil]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__GFP_HARDWALL only has meaning in the context of cpusets but the fast
path always applies the flag on the first attempt. Move the
manipulations into the cpuset paths where they will be masked by a
static branch in the common case.
With the other micro-optimisations in this series combined, the impact
on a page allocator microbenchmark is
4.6.0-rc2 4.6.0-rc2
decstat-v1r20 micro-v1r20
Min alloc-odr0-1 381.00 ( 0.00%) 377.00 ( 1.05%)
Min alloc-odr0-2 275.00 ( 0.00%) 273.00 ( 0.73%)
Min alloc-odr0-4 229.00 ( 0.00%) 226.00 ( 1.31%)
Min alloc-odr0-8 199.00 ( 0.00%) 196.00 ( 1.51%)
Min alloc-odr0-16 186.00 ( 0.00%) 183.00 ( 1.61%)
Min alloc-odr0-32 179.00 ( 0.00%) 175.00 ( 2.23%)
Min alloc-odr0-64 174.00 ( 0.00%) 172.00 ( 1.15%)
Min alloc-odr0-128 172.00 ( 0.00%) 170.00 ( 1.16%)
Min alloc-odr0-256 181.00 ( 0.00%) 183.00 ( -1.10%)
Min alloc-odr0-512 193.00 ( 0.00%) 191.00 ( 1.04%)
Min alloc-odr0-1024 201.00 ( 0.00%) 199.00 ( 1.00%)
Min alloc-odr0-2048 206.00 ( 0.00%) 204.00 ( 0.97%)
Min alloc-odr0-4096 212.00 ( 0.00%) 210.00 ( 0.94%)
Min alloc-odr0-8192 215.00 ( 0.00%) 213.00 ( 0.93%)
Min alloc-odr0-16384 216.00 ( 0.00%) 214.00 ( 0.93%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page is guaranteed to be set before it is read with or without the
initialisation.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zonelist here is a copy of a struct field that is used once. Ditch it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The number of zones skipped to a zone expiring its fair zone allocation
quota is irrelevant. Convert to bool.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler. Even
without an impact, it makes more sense that this be unsigned.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pageblocks have an associated bitmap to store migrate types and whether
the pageblock should be skipped during compaction. The bitmap may be
associated with a memory section or a zone but the zone is looked up
unconditionally. The compiler should optimise this away automatically
so this is a cosmetic patch only in many cases.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page allocator iterates through a zonelist for zones that match the
addressing limitations and nodemask of the caller but many allocations
will not be restricted. Despite this, there is always functional call
overhead which builds up.
This patch inlines the optimistic basic case and only calls the iterator
function for the complex case. A hindrance was the fact that
cpuset_current_mems_allowed is used in the fastpath as the allowed
nodemask even though all nodes are allowed on most systems. The patch
handles this by only considering cpuset_current_mems_allowed if a cpuset
exists. As well as being faster in the fast-path, this removes some
junk in the slowpath.
The performance difference on a page allocator microbenchmark is;
4.6.0-rc2 4.6.0-rc2
statinline-v1r20 optiter-v1r20
Min alloc-odr0-1 412.00 ( 0.00%) 382.00 ( 7.28%)
Min alloc-odr0-2 301.00 ( 0.00%) 282.00 ( 6.31%)
Min alloc-odr0-4 247.00 ( 0.00%) 233.00 ( 5.67%)
Min alloc-odr0-8 215.00 ( 0.00%) 203.00 ( 5.58%)
Min alloc-odr0-16 199.00 ( 0.00%) 188.00 ( 5.53%)
Min alloc-odr0-32 191.00 ( 0.00%) 182.00 ( 4.71%)
Min alloc-odr0-64 187.00 ( 0.00%) 177.00 ( 5.35%)
Min alloc-odr0-128 185.00 ( 0.00%) 175.00 ( 5.41%)
Min alloc-odr0-256 193.00 ( 0.00%) 184.00 ( 4.66%)
Min alloc-odr0-512 207.00 ( 0.00%) 197.00 ( 4.83%)
Min alloc-odr0-1024 213.00 ( 0.00%) 203.00 ( 4.69%)
Min alloc-odr0-2048 220.00 ( 0.00%) 209.00 ( 5.00%)
Min alloc-odr0-4096 226.00 ( 0.00%) 214.00 ( 5.31%)
Min alloc-odr0-8192 229.00 ( 0.00%) 218.00 ( 4.80%)
Min alloc-odr0-16384 229.00 ( 0.00%) 219.00 ( 4.37%)
perf indicated that next_zones_zonelist disappeared in the profile and
__next_zones_zonelist did not appear. This is expected as the
micro-benchmark would hit the inlined fast-path every time.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PageAnon check always checks for compound_head but this is a
relatively expensive check if the caller already knows the page is a
head page. This patch creates a helper and uses it in the page free
path which only operates on head pages.
With this patch and "Only check PageCompound for high-order pages", the
performance difference on a page allocator microbenchmark is;
4.6.0-rc2 4.6.0-rc2
vanilla nocompound-v1r20
Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%)
Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%)
Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%)
Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%)
Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%)
Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%)
Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%)
Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%)
Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%)
Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%)
Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%)
Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%)
Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%)
Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%)
Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%)
Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%)
Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%)
Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%)
Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%)
Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%)
Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%)
Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%)
Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%)
Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%)
Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%)
Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%)
Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%)
Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%)
There is a sizable boost to the free allocator performance. While there
is an apparent boost on the allocation side, it's likely a co-incidence
or due to the patches slightly reducing cache footprint.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now the oom reaper will clear TIF_MEMDIE only for tasks which were
successfully reaped. This is the safest option because we know that
such an oom victim would only block forward progress of the oom killer
without a good reason because it is highly unlikely it would release
much more memory. Basically most of its memory has been already torn
down.
We can relax this assumption to catch more corner cases though.
The first obvious one is when the oom victim clears its mm and gets
stuck later on. oom_reaper would back of on find_lock_task_mm returning
NULL. We can safely try to clear TIF_MEMDIE in this case because such a
task would be ignored by the oom killer anyway. The flag would be
cleared by that time already most of the time anyway.
The less obvious one is when the oom reaper fails due to mmap_sem
contention. Even if we clear TIF_MEMDIE for this task then it is not
very likely that we would select another task too easily because we
haven't reaped the last victim and so it would be still the #1
candidate. There is a rare race condition possible when the current
victim terminates before the next select_bad_process but considering
that oom_reap_task had retried several times before giving up then this
sounds like a borderline thing.
After this patch we should have a guarantee that the OOM killer will not
be block for unbounded amount of time for most cases.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task and
queues it for the oom reaper if that is safe to be done meaning that the
task doesn't share the mm with an alive process.
This might help to release the memory pressure while the task tries to
exit.
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__alloc_pages_may_oom is the central place to decide when the
out_of_memory should be invoked. This is a good approach for most
checks there because they are page allocator specific and the allocation
fails right after for all of them.
The notable exception is GFP_NOFS context which is faking
did_some_progress and keep the page allocator looping even though there
couldn't have been any progress from the OOM killer. This patch doesn't
change this behavior because we are not ready to allow those allocation
requests to fail yet (and maybe we will face the reality that we will
never manage to safely fail these request). Instead __GFP_FS check is
moved down to out_of_memory and prevent from OOM victim selection there.
There are two reasons for that
- OOM notifiers might release some memory even from this context
as none of the registered notifier seems to be FS related
- this might help a dying thread to get an access to memory
reserves and move on which will make the behavior more
consistent with the case when the task gets killed from a
different context.
Keep a comment in __alloc_pages_may_oom to make sure we do not forget
how GFP_NOFS is special and that we really want to do something about
it.
Note to the current oom_notifier users:
The observable difference for you is that oom notifiers cannot depend on
any fs locks because we could deadlock. Not that this would be allowed
today because that would just lockup machine in most of the cases and
ruling out the OOM killer along the way. Another difference is that
callbacks might be invoked sooner now because GFP_NOFS is a weaker
reclaim context and so there could be reclaimable memory which is just
not reachable now. That would require GFP_NOFS only loads which are
really rare and more importantly the observable result would be dropping
of reconstructible object and potential performance drop which is not
such a big deal when we are struggling to fulfill other important
allocation requests.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the
memory hotplug onlining policy. Add a command line parameter to make it
possible to override the default. It may come handy for debug and
testing purposes.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>