Commit 4942642080 ("mm: memcg: handle non-error OOM situations more
gracefully") allowed tasks that already entered a memcg OOM condition to
bypass the memcg limit on subsequent allocation attempts hoping this
would expedite finishing the page fault and executing the kill.
David Rientjes is worried that this breaks memcg isolation guarantees
and since there is no evidence that the bypass actually speeds up fault
processing just change it so that these subsequent charge attempts fail
outright. The notable exception being __GFP_NOFAIL charges which are
required to bypass the limit regardless.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-bt: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race condition between a memcg being torn down and a swapin
triggered from a different memcg of a page that was recorded to belong
to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The
result is unreclaimable pages pointing to dead memcgs, which can lead to
anything from endless loops in later memcg teardown (the page is charged
to all hierarchical parents but is not on any LRU list) or crashes from
following the dangling memcg pointer.
Memcgs with tasks in them can not be torn down and usually charges don't
show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP
extension is the notable exception because it charges the cgroup that
was recorded as owner during swapout, which may be empty and in the
process of being torn down when a task in another memcg triggers the
swapin:
teardown: swapin:
lookup_swap_cgroup_id()
rcu_read_lock()
mem_cgroup_lookup()
css_tryget()
rcu_read_unlock()
disable css_tryget()
call_rcu()
offline_css()
reparent_charges()
res_counter_charge() (hierarchical!)
css_put()
css_free()
pc->mem_cgroup = dead memcg
add page to dead lru
Add a final reparenting step into css_free() to make sure any such raced
charges are moved out of the memcg before it's finally freed.
In the longer term it would be cleaner to have the css_tryget() and the
res_counter charge under the same RCU lock section so that the charge
reparenting is deferred until the last charge whose tryget succeeded is
visible. But this will require more invasive changes that will be
harder to evaluate and backport into stable, so better defer them to a
separate change set.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 84235de394 ("fs: buffer: move allocation failure loop into the
allocator") started recognizing __GFP_NOFAIL in memory cgroups but
forgot to disable the OOM killer.
Any task that does not fail allocation will also not enter the OOM
completion path. So don't declare an OOM state in this case or it'll be
leaked and the task be able to bypass the limit until the next
userspace-triggered page fault cleans up the OOM state.
Reported-by: William Dauchy <wdauchy@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With split ptlock it's important to know which lock
pmd_trans_huge_lock() took. This patch adds one more parameter to the
function to return the lock.
In most places migration to new api is trivial. Exception is
move_huge_pmd(): we need to take two locks if pmd tables are different.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) The addition of nftables. No longer will we need protocol aware
firewall filtering modules, it can all live in userspace.
At the core of nftables is a, for lack of a better term, virtual
machine that executes byte codes to inspect packet or metadata
(arriving interface index, etc.) and make verdict decisions.
Besides support for loading packet contents and comparing them, the
interpreter supports lookups in various datastructures as
fundamental operations. For example sets are supports, and
therefore one could create a set of whitelist IP address entries
which have ACCEPT verdicts attached to them, and use the appropriate
byte codes to do such lookups.
Since the interpreted code is composed in userspace, userspace can
do things like optimize things before giving it to the kernel.
Another major improvement is the capability of atomically updating
portions of the ruleset. In the existing netfilter implementation,
one has to update the entire rule set in order to make a change and
this is very expensive.
Userspace tools exist to create nftables rules using existing
netfilter rule sets, but both kernel implementations will need to
co-exist for quite some time as we transition from the old to the
new stuff.
Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
worked so hard on this.
2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
to our pseudo-random number generator, mostly used for things like
UDP port randomization and netfitler, amongst other things.
In particular the taus88 generater is updated to taus113, and test
cases are added.
3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
and Yang Yingliang.
4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
Sujir.
5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
Neal Cardwell, and Yuchung Cheng.
6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
control message data, much like other socket option attributes.
From Francesco Fusco.
7) Allow applications to specify a cap on the rate computed
automatically by the kernel for pacing flows, via a new
SO_MAX_PACING_RATE socket option. From Eric Dumazet.
8) Make the initial autotuned send buffer sizing in TCP more closely
reflect actual needs, from Eric Dumazet.
9) Currently early socket demux only happens for TCP sockets, but we
can do it for connected UDP sockets too. Implementation from Shawn
Bohrer.
10) Refactor inet socket demux with the goal of improving hash demux
performance for listening sockets. With the main goals being able
to use RCU lookups on even request sockets, and eliminating the
listening lock contention. From Eric Dumazet.
11) The bonding layer has many demuxes in it's fast path, and an RCU
conversion was started back in 3.11, several changes here extend the
RCU usage to even more locations. From Ding Tianhong and Wang
Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
Falico.
12) Allow stackability of segmentation offloads to, in particular, allow
segmentation offloading over tunnels. From Eric Dumazet.
13) Significantly improve the handling of secret keys we input into the
various hash functions in the inet hashtables, TCP fast open, as
well as syncookies. From Hannes Frederic Sowa. The key fundamental
operation is "net_get_random_once()" which uses static keys.
Hannes even extended this to ipv4/ipv6 fragmentation handling and
our generic flow dissector.
14) The generic driver layer takes care now to set the driver data to
NULL on device removal, so it's no longer necessary for drivers to
explicitly set it to NULL any more. Many drivers have been cleaned
up in this way, from Jingoo Han.
15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.
16) Improve CRC32 interfaces and generic SKB checksum iterators so that
SCTP's checksumming can more cleanly be handled. Also from Daniel
Borkmann.
17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
using the interface MTU value. This helps avoid PMTU attacks,
particularly on DNS servers. From Hannes Frederic Sowa.
18) Use generic XPS for transmit queue steering rather than internal
(re-)implementation in virtio-net. From Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
random32: add test cases for taus113 implementation
random32: upgrade taus88 generator to taus113 from errata paper
random32: move rnd_state to linux/random.h
random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
random32: add periodic reseeding
random32: fix off-by-one in seeding requirement
PHY: Add RTL8201CP phy_driver to realtek
xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
macmace: add missing platform_set_drvdata() in mace_probe()
ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
ixgbe: add warning when max_vfs is out of range.
igb: Update link modes display in ethtool
netfilter: push reasm skb through instead of original frag skbs
ip6_output: fragment outgoing reassembled skb properly
MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
net_sched: tbf: support of 64bit rates
ixgbe: deleting dfwd stations out of order can cause null ptr deref
ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
...
Merge first patch-bomb from Andrew Morton:
"Quite a lot of other stuff is banked up awaiting further
next->mainline merging, but this batch contains:
- Lots of random misc patches
- OCFS2
- Most of MM
- backlight updates
- lib/ updates
- printk updates
- checkpatch updates
- epoll tweaking
- rtc updates
- hfs
- hfsplus
- documentation
- procfs
- update gcov to gcc-4.7 format
- IPC"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
ipc, msg: fix message length check for negative values
ipc/util.c: remove unnecessary work pending test
devpts: plug the memory leak in kill_sb
./Makefile: export initial ramdisk compression config option
init/Kconfig: add option to disable kernel compression
drivers: w1: make w1_slave::flags long to avoid memory corruption
drivers/w1/masters/ds1wm.cuse dev_get_platdata()
drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
drivers/memstick/core/mspro_block.c: fix attributes array allocation
drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
kernel/panic.c: reduce 1 byte usage for print tainted buffer
gcov: reuse kbasename helper
kernel/gcov/fs.c: use pr_warn()
kernel/module.c: use pr_foo()
gcov: compile specific gcov implementation based on gcc version
gcov: add support for gcc 4.7 gcov format
gcov: move gcov structs definitions to a gcc version specific file
kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
...
Pull cgroup changes from Tejun Heo:
"Not too much activity this time around. css_id is finally killed and
a minor update to device_cgroup"
* 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
device_cgroup: remove can_attach
cgroup: kill css_id
memcg: stop using css id
memcg: fail to create cgroup if the cgroup id is too big
memcg: convert to use cgroup id
memcg: convert to use cgroup_is_descendant()
The memory.numa_stat file was not hierarchical. Memory charged to the
children was not shown in parent's numa_stat.
This change adds the "hierarchical_" stats to the existing stats. The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.
Tested: Create cgroup a, a/b and run workload under b. The values of
b are included in the "hierarchical_*" under a.
$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b
Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &
The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code. This consolidates nearly identical code.
text data bss dec hex filename
8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use helper function to check if we need to deal with oom condition.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/net/ethernet/emulex/benet/be.h
drivers/net/netconsole.c
net/bridge/br_private.h
Three mostly trivial conflicts.
The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.
In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".
Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
cgroup writeback pages accounting" there's bad pointer read. The goal
was to check for counter underflow. The counter is a per cpu counter
and there are two problems with the code:
(1) per cpu access function isn't used, instead a naked pointer is used
which easily causes oops.
(2) the check doesn't sum all cpus
Test:
$ cd /sys/fs/cgroup/memory
$ mkdir x
$ echo 3 > /proc/sys/vm/drop_caches
$ (echo $BASHPID >> x/tasks && exec cat) &
[1] 7154
$ grep ^mapped x/memory.stat
mapped_file 53248
$ echo 7154 > tasks
$ rmdir x
<OOPS>
The fix is to remove the check. It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't
enough to fix this because there's no guarantees of the current cpus
count. The only guarantees is that the sum of all per-cpu counter is >=
nr_pages.
Fixes: 3ea67d06e4 ("memcg: add per cgroup writeback pages accounting")
Reported-and-tested-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memcg code needs to know whether any given memcg has children, it
uses the cgroup child iteration primitives and returns true/false
depending on whether the iteration loop is executed at least once or
not.
Because a cgroup's list of children is RCU protected, these primitives
require the RCU read-lock to be held, which is not the case for all
memcg callers. This results in the following splat when e.g. enabling
hierarchy mode:
WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
Call Trace:
dump_stack+0x54/0x74
warn_slowpath_common+0x78/0xa0
warn_slowpath_null+0x1a/0x20
css_next_child+0xa3/0x160
mem_cgroup_hierarchy_write+0x5b/0xa0
cgroup_file_write+0x108/0x2a0
vfs_write+0xbd/0x1e0
SyS_write+0x4c/0xa0
system_call_fastpath+0x16/0x1b
In the memcg case, we only care about children when we are attempting to
modify inheritable attributes interactively. Racing with deletion could
mean a spurious -EBUSY, no problem. Racing with addition is handled
just fine as well through the memcg_create_mutex: if the child group is
not on the list after the mutex is acquired, it won't be initialized
from the parent's attributes until after the unlock.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg OOM lock is a mutex-type lock that is open-coded due to
memcg's special needs. Add annotations for lockdep coverage.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 84235de394 ("fs: buffer: move allocation failure loop into the
allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
fail to reclaim enough memory for the charge. But because the main test
case was on a 3.2-based system, the patch missed the fact that on newer
kernels the charge function needs to return root_mem_cgroup when
bypassing the limit, and not NULL. This will corrupt whatever memory is
at NULL + percpu pointer offset. Fix this quickly before problems are
reported.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As of commit 3ea67d06e4 ("memcg: add per cgroup writeback pages
accounting") memcg counter errors are possible when moving charged
memory to a different memcg. Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.
An example showing error after memory.force_empty:
$ cd /sys/fs/cgroup/memory
$ mkdir x
$ rm /data/tmp/file
$ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
[1] 13600
$ grep ^mapped x/memory.stat
mapped_file 1048576
$ echo 13600 > tasks
$ echo 1 > x/memory.force_empty
$ grep ^mapped x/memory.stat
mapped_file 4503599627370496
mapped_file should end with 0.
4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
1048576 == 0x10,0000 == 0x100 pages
This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct. So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1
cases are larger problems as the bogus counters are visible for the
(possibly long) remaining life of the source memcg.
The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
is subtly wrong because it subtracts the unsigned int nr_pages (either
-1 or -512 for THP) from a signed long percpu counter. When
nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx]
is signed 64 bit. So memcg's attempt to simply decrement a count (e.g.
from 1 to 0) boils down to:
long count = 1
unsigned int nr_pages = 1
count += -nr_pages /* -nr_pages == 0xffff,ffff */
count is now 0x1,0000,0000 instead of 0
The fix is to subtract the unsigned page count rather than adding its
negation. This only works once "percpu: fix this_cpu_sub() subtrahend
casting for unsigneds" is applied to fix this_cpu_sub().
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
drivers/net/usb/qmi_wwan.c
include/net/dst.h
Trivial merge conflicts, both were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the pointers in struct cg_proto with actual data fields and kill
struct tcp_memcontrol as it is not fully redundant.
This removes a confusing, unnecessary layer of abstraction.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Buffer allocation has a very crude indefinite loop around waking the
flusher threads and performing global NOFS direct reclaim because it can
not handle allocation failures.
The most immediate problem with this is that the allocation may fail due
to a memory cgroup limit, where flushers + direct reclaim might not make
any progress towards resolving the situation at all. Because unlike the
global case, a memory cgroup may not have any cache at all, only
anonymous pages but no swap. This situation will lead to a reclaim
livelock with insane IO from waking the flushers and thrashing unrelated
filesystem cache in a tight loop.
Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
any looping happens in the page allocator, which knows how to
orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
allows memory cgroups to detect allocations that can't handle failure
and will allow them to ultimately bypass the limit if reclaim can not
make progress.
Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead. But there are many more and it's impractical to annotate
them all.
First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well. This simplifies the code quite a bit for
added bonus.
Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts. If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.
Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
cpu_online_mask doesn't change during the iteration.
cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
that is used kernel-wide to synchronize cpu hotplug activity. Memcg has
a cpu hotplug notifier, called while there may not be any cpu hotplug
refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
to maintain a cumulative event count as cpus disappear. Without
get_online_cpus() in mem_cgroup_read_events(), it's possible to account
for the event count on a dying cpu twice, and this value may be
significantly large.
In fact, all memcg->pcp_counter_lock use should be nested by
{get,put}_online_cpus().
This fixes that issue and ensures the reported statistics are not vastly
over-reported during cpu hotplug.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 3b38722efd ("memcg, vmscan: integrate soft reclaim
tighter with zone shrinking code")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit e883110aad ("memcg: get rid of soft-limit tree
infrastructure")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit a5b7c87f92 ("vmscan, memcg: do softlimit reclaim also
for targeted reclaim")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit de57780dc6 ("memcg: enhance memcg iterator to support
predicates")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 7d910c054b ("memcg: track children in soft limit excess
to improve soft limit")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit e839b6a1c8 ("memcg, vmscan: do not attempt soft limit
reclaim if it would not scan anything")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 1be171d60b ("memcg: track all children over limit in the
root")
I merged this prematurely - Michal and Johannes still disagree about the
overall design direction and the future remains unclear.
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now memcg uses cgroup id instead of css id. Update some comments and
set mem_cgroup_subsys->use_id to 0.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
memcg requires the cgroup id to be smaller than 65536.
This is a preparation to kill css id.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use cgroup id instead of css id. This is a preparation to kill css id.
Note, as memcg treat 0 as an invalid id, while cgroup id starts with 0,
we define memcg_id == cgroup_id + 1.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is a preparation to kill css_id.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add memcg routines to count writeback pages, later dirty pages will also
be accounted.
After Kame's commit 89c06bd52f ("memcg: use new logic for page stat
accounting"), we can use 'struct page' flag to test page state instead
of per page_cgroup flag. But memcg has a feature to move a page from a
cgroup to another one and may have race between "move" and "page stat
accounting". So in order to avoid the race we have designed a new lock:
mem_cgroup_begin_update_page_stat()
modify page information -->(a)
mem_cgroup_update_page_stat() -->(b)
mem_cgroup_end_update_page_stat()
It requires both (a) and (b)(writeback pages accounting) to be pretected
in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
!CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
read lock in the most cases (no task is moving), and spin_lock_irqsave
on top in the slow path.
There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
And the lock order is:
--> memcg->move_lock
--> mapping->tree_lock
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should call mem_cgroup_begin_update_page_stat() before
mem_cgroup_update_page_stat() to get proper locks, however the latter
doesn't do any checking that we use proper locking, which would be hard.
Suggested by Michal Hock we could at least test for rcu_read_lock_held()
because RCU is held if !mem_cgroup_disabled().
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While accounting memcg page stat, it's not worth to use
MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
complexity and presumed performance overhead. We can use
MEM_CGROUP_STAT_FILE_MAPPED directly.
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved. The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.
For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex. The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:
OOM invoking task:
mem_cgroup_handle_oom+0x241/0x3b0
mem_cgroup_cache_charge+0xbe/0xe0
add_to_page_cache_locked+0x4c/0x140
add_to_page_cache_lru+0x22/0x50
grab_cache_page_write_begin+0x8b/0xe0
ext3_write_begin+0x88/0x270
generic_file_buffered_write+0x116/0x290
__generic_file_aio_write+0x27c/0x480
generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
do_sync_write+0xea/0x130
vfs_write+0xf3/0x1f0
sys_write+0x51/0x90
system_call_fastpath+0x18/0x1d
OOM kill victim:
do_truncate+0x58/0xa0 # takes i_mutex
do_last+0x250/0xa30
path_openat+0xd7/0x440
do_filp_open+0x49/0xa0
do_sys_open+0x106/0x240
sys_open+0x20/0x30
system_call_fastpath+0x18/0x1d
The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.
A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit. But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks. For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.
This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:
1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.
2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.
Debugged by Michal Hocko.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies. However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now. Clean
up as follows:
1. Remove the return value of mem_cgroup_oom_unlock()
2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
makes it more obvious that the task has to be on the waitqueue
before attempting to OOM-trylock the hierarchy, to not miss any
wakeups before going to sleep. It just didn't matter until now
because it was all lumped together into the global memcg_oom_lock
spinlock section.
4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
It is proctected by the hierarchical OOM-lock.
5. The memcg_oom_lock spinlock is only required to propagate the OOM
lock in any given hierarchy atomically. Restrict its scope to
mem_cgroup_oom_(trylock|unlock).
6. Do not wake up the waitqueue unconditionally at the end of the
function. Only the lockholder has to wake up the next in line
after releasing the lock.
Note that the lockholder kicks off the OOM-killer, which in turn
leads to wakeups from the uncharges of the exiting task. But a
contender is not guaranteed to see them if it enters the OOM path
after the OOM kills but before the lockholder releases the lock.
Thus there has to be an explicit wakeup after releasing the lock.
7. Put the OOM task on the waitqueue before marking the hierarchy as
under OOM as that is the point where we start to receive wakeups.
No point in listening before being on the waitqueue.
8. Likewise, unmark the hierarchy before finishing the sleep, for
symmetry.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.
Enable the memcg OOM killer only for user faults, where it's really the
only option available.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up some mess made by the "Soft limit rework" series, and a few other
things.
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Children in soft limit excess are currently tracked up the hierarchy in
memcg->children_in_excess. Nevertheless there still might exist tons of
groups that are not in hierarchy relation to the root cgroup (e.g. all
first level groups if root_mem_cgroup->use_hierarchy == false).
As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there is
no child above the limit without iterating them. This can be done
easily if the root tracks all children rather than only hierarchical
children. This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the
root knows about all children in excess.
Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so
no special optimization is necessary.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
done and it always says yes currently. Memcg iterators are clever to
skip nodes that are not soft reclaimable quite efficiently but
mem_cgroup_should_soft_reclaim can be more clever and do not start the
soft reclaim pass at all if it knows that nothing would be scanned
anyway.
In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
the target group of the reclaim and allow the pass only if the whole
subtree wouldn't be skipped.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so the
only choice left is to let iterators to visit each node and do the
selection outside of the iterating code. This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.
This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node. There are three
possible ways how the callback can influence the walk. Either the node is
visited, it is skipped but the tree walk continues down the tree or the
whole subtree of the current group is skipped.
[hughd@google.com: fix memcg-less page reclaim]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the soft
limit reclaim doesn't use any special code paths and it is a part of the
zone shrinking code which is used by both global and targeted reclaims.
From the semantic point of view it is natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is a
memory pressure. It is not important whether the pressure comes from the
limit or imbalanced zones.
This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy. Say we have
A (over soft limit)
\
B (below s.l., hit the hard limit)
/ \
C D (below s.l.)
B is the source of the outside memory pressure now for D but we shouldn't
soft reclaim it because it is behaving well under B subtree and we can
still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore. Rip it out.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset is sitting out of tree for quite some time without any
objections. I would be really happy if it made it into 3.12. I do not
want to push it too hard but I think this work is basically ready and
waiting more doesn't help.
The basic idea is quite simple. Pull soft reclaim into shrink_zone in the
first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which has
been updated in the third step to have only negligible overhead.
As a bonus we will get rid of a _lot_ of code by this and soft reclaim
will not stand out like before when it wasn't integrated into the zone
shrinking code and it reclaimed at priority 0 (the testing results show
that some workloads suffers from such an aggressive reclaim). The clean
up is in a separate patch because I felt it would be easier to review that
way.
The second step is soft limit reclaim integration into targeted reclaim.
It should be rather straight forward. Soft limit has been used only for
the global reclaim so far but it makes sense for any kind of pressure
coming from up-the-hierarchy, including targeted reclaim.
The third step (patches 4-8) addresses the tree walk overhead by enhancing
memcg iterators to enable skipping whole subtrees and tracking number of
over soft limit children at each level of the hierarchy. This information
is updated same way the old soft limit tree was updated (from
memcg_check_events) so we shouldn't see an additional overhead. In fact
mem_cgroup_update_soft_limit is much simpler than tree manipulation done
previously.
__shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
mem_cgroup_iter so the decision whether a particular group should be
visited is done at the iterator level which allows us to decide to skip
the whole subtree as well (if there is no child in excess). This reduces
the tree walk overhead considerably.
* TEST 1
========
My primary test case was a parallel kernel build with 2 groups (make is
running with -j8 with a distribution .config in a separate cgroup without
any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
run taskset to Node 0 cpus.
I was mostly interested in 2 setups. Default - no soft limit set and -
and 0 soft limit set to both groups. The first one should tell us whether
the rework regresses the default behavior while the second one should show
us improvements in an extreme case where both workloads are always over
the soft limit.
/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.
base is mmotm-2013-07-18-16-40
rework all 8 patches applied on top of base
* No-limit
User
no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
System
no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
Elapsed
no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6
The results are within noise. Elapsed time has a bigger variance but the
average looks good.
* 0-limit
User
0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
System
0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
Elapsed
0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6
The improvement is really huge here (even bigger than with my previous
testing and I suspect that this highly depends on the storage). Page
fault statistics tell us at least part of the story:
Minor
0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
Major
0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6
Same as with my previous testing Minor faults are more or less within
noise but Major fault count is way bellow the base kernel.
While this looks as a nice win it is fair to say that 0-limit
configuration is quite artificial. So I was playing with 0-no-limit
loads as well.
* TEST 2
========
The following results are from 2 groups configuration on a 16GB machine
(single NUMA node).
- A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
2*TotalMem with 0 soft limit.
- B running a mem_eater which consumes TotalMem-1G without any limit. The
mem_eater consumes the memory in 100 chunks with 1s nap after each
mmap+poppulate so that both loads have chance to fight for the memory.
The expected result is that B shouldn't be reclaimed and A shouldn't see
a big dropdown in elapsed time.
User
base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
System
base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
Elapsed
base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3
System time improved slightly as well as Elapsed. My previous testing
has shown worse numbers but this again seem to depend on the storage
speed.
My theory is that the writeback doesn't catch up and prio-0 soft reclaim
falls into wait on writeback page too often in the base kernel. The
patched kernel doesn't do that because the soft reclaim is done from the
kswapd/direct reclaim context. This can be seen on the following graph
nicely. The A's group usage_in_bytes regurarly drops really low very often.
All 3 runs
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
resp. a detail of the single run
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png
mem_eater seems to be doing better as well. It gets to the full
allocation size faster as can be seen on the following graph:
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png
/proc/meminfo collected during the test also shows that rework kernel
hasn't swapped that much (well almost not at all):
base: max: 123900 K avg: 56388.29 K
rework: max: 300 K avg: 128.68 K
kswapd and direct reclaim statistics are of no use unfortunatelly because
soft reclaim is not accounted properly as the counters are hidden by
global_reclaim() checks in the base kernel.
* TEST 3
========
Another test was the same configuration as TEST2 except the stream IO was
replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
left.
Kbuild did better with the rework kernel here as well:
User
base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
System
base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
Elapsed
base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
Minor
base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
Major
base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3
Again we can see a significant improvement in Elapsed (it also seems to
be more stable), there is a huge dropdown for the Major page faults and
much more swapping:
base: max: 583736 K avg: 112547.43 K
rework: max: 4012 K avg: 124.36 K
Graphs from all three runs show the variability of the kbuild quite
nicely. It even seems that it took longer after every run with the base
kernel which would be quite surprising as the source tree for the build is
removed and caches are dropped after each run so the build operates on a
freshly extracted sources everytime.
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png
My other testing shows that this is just a matter of timing and other runs
behave differently the std for Elapsed time is similar ~50. Example of
other three runs:
http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png
So to wrap this up. The series is still doing good and improves the soft
limit.
The testing results for bunch of cgroups with both stream IO and kbuild
loads can be found in "memcg: track children in soft limit excess to
improve soft limit".
This patch:
Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and reclaimed
it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not cost
free. Although this overhead hasn't turned out to be a bottle neck the
implementation is suboptimal because mem_cgroup_update_tree has no idea
which zones consumed memory over the limit so we could easily end up
having a group on a node-zone tree having only few pages from that
node-zone.
This patch doesn't try to fix node-zone trees management because it seems
that integrating soft reclaim into zone shrinking sounds much easier and
more appropriate for several reasons. First of all 0 priority reclaim was
a crude hack which might lead to big stalls if the group's LRUs are big
and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim
should be applicable also to the targeted reclaim which is awkward right
now without additional hacks. Last but not least the whole infrastructure
eats quite some code.
After this patch shrink_zone is done in 2 passes. First it tries to do
the soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned during
the first pass. Only groups which are over their soft limit or any of
their parents up the hierarchy is over the limit are considered eligible
during the first pass.
Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vfs guarantees the cgroup won't be destroyed, so it's redundant to get a
css reference.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>