cgroup v2 is in the process of growing thread granularity support. A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.
The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged. Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.
This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.
* cgroup->dom_cgrp points to self for normal and thread roots. For
proper thread subtree members, points to the dom_cgrp (the thread
root).
* css_set->dom_cset points to self if for normal and thread roots. If
threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
The dom_cgrp serves as the resource domain and keeps the matching
csses available. The dom_cset holds those csses and makes them
easily accessible.
* All threaded csets are linked on their dom_csets to enable iteration
of all threaded tasks.
* cgroup->nr_threaded_children keeps track of the number of threaded
children.
This patch adds the above but doesn't actually use them yet. The
following patches will build on top.
v4: ->nr_threaded_children added.
v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new
enable-threaded-per-cgroup behavior.
v2: Added cgroup_is_threaded() helper.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2. This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.
While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.
v3: - Restructured so that cgroup_procs_write_permission() takes
@src_cgrp and @dst_cgrp.
v2: - Rolled in Waiman's task reference count fix.
- Updated on top of nsdelegate changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
On subsystem registration, css_populate_dir() is not called on the new
root css, so the interface files for the subsystem on cgrp_dfl_root
aren't created on registration. This is a residue from the days when
cgrp_dfl_root was used only as the parking spot for unused subsystems,
which no longer is true as it's used as the root for cgroup2.
This is often fine as later operations tend to create them as a part
of mount (cgroup1) or subtree_control operations (cgroup2); however,
it's not difficult to mount cgroup2 with the controller interface
files missing as Waiman found out.
Fix it by invoking css_populate_dir() on the root css on subsys
registration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Waiman Long <longman@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Tejun Heo <tj@kernel.org>
Implement trivial cgroup_has_tasks() which tests whether
cgrp->nr_populated_csets is zero and replace the explicit local
populated test in cgroup_subtree_control(). This simplifies the code
and cgroup_has_tasks() will be used in more places later.
Signed-off-by: Tejun Heo <tj@kernel.org>
cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.
This patch splits the counter into two so that local and children
populated states are tracked separately. It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.
Signed-off-by: Tejun Heo <tj@kernel.org>
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
When updating task's mems_allowed and rebinding its mempolicy due to
cpuset's mems being changed, we currently only take the seqlock for
writing when either the task has a mempolicy, or the new mems has no
intersection with the old mems.
This should be enough to prevent a parallel allocation seeing no
available nodes, but the optimization is IMHO unnecessary (cpuset
updates should not be frequent), and we still potentially risk issues if
the intersection of new and old nodes has limited amount of
free/reclaimable memory.
Let's just use the seqlock for all tasks.
Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c0ff7453bb ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") has introduced a two-step protocol when
rebinding task's mempolicy due to cpuset update, in order to avoid a
parallel allocation seeing an empty effective nodemask and failing.
Later, commit cc9a6c8776 ("cpuset: mm: reduce large amounts of memory
barrier related damage v3") introduced a seqlock protection and removed
the synchronization point between the two update steps. At that point
(or perhaps later), the two-step rebinding became unnecessary.
Currently it only makes sure that the update first adds new nodes in
step 1 and then removes nodes in step 2. Without memory barriers the
effects are questionable, and even then this cannot prevent a parallel
zonelist iteration checking the nodemask at each step to observe all
nodes as unusable for allocation. We now fully rely on the seqlock to
prevent premature OOMs and allocation failures.
We can thus remove the two-step update parts and simplify the code.
Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments. This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries. A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.
This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries. If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.
This allows cgroup namespace to function as a delegation boundary by
itself.
v2: Silently ignore nsdelegate specified on !init mounts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Restructure cgroup_procs_write_permission() to make extending
permission logic easier.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
The debug controller grabs cgroup_mutex from interface file show
functions which can deadlock and triggers lockdep warnings. Fix it by
using cgroup_kn_lock_live()/cgroup_kn_unlock() instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Factor out cgroup_masks_read_one() out of cgroup_masks_read() for
simplicity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Make debug an implicit controller on cgroup2 which is enabled by
"cgroup_debug" boot param.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Besides supporting cgroup v2 and thread mode, the following changes
are also made:
1) current_* cgroup files now resides only at the root as we don't
need duplicated files of the same function all over the cgroup
hierarchy.
2) The cgroup_css_links_read() function is modified to report
the number of tasks that are skipped because of overflow.
3) The number of extra unaccounted references are displayed.
4) The current_css_set_read() function now prints out the addresses of
the css'es associated with the current css_set.
5) A new cgroup_subsys_states file is added to display the css objects
associated with a cgroup.
6) A new cgroup_masks file is added to display the various controller
bit masks in the cgroup.
tj: Dropped thread mode related information for now so that debug
controller changes aren't blocked on the thread mode.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The Kconfig prompt and description of the debug cgroup controller
more accurate by saying that it is for debug purpose only and its
interfaces are unstable.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it makes
sense to put the code into its own file as it will no longer be v1
specific. There is no change to the debug cgroup specific code.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.
tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
Refreshed on top of cgroup/for-v4.13 which dropped on
css_set_populated() -> nr_tasks conversion.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In most cases, a cgroup controller don't care about the liftimes of
cgroups. For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.
However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not. Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period. The effects of cgroup
removals are delayed when seen from userland.
This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending. This gets rid of the userland visible
delays.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+
Pull cgroup updates from Tejun Heo:
"Nothing major. Two notable fixes are Li's second stab at fixing the
long-standing race condition in the mount path and suppression of
spurious warning from cgroup_get(). All other changes are trivial"
* 'for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark cgroup_get() with __maybe_unused
cgroup: avoid attaching a cgroup root to two different superblocks, take 2
cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()
cgroup: move cgroup_subsys_state parent field for cache locality
cpuset: Remove cpuset_update_active_cpus()'s parameter.
cgroup: switch to BUG_ON()
cgroup: drop duplicate header nsproxy.h
kernel: convert css_set.refcount from atomic_t to refcount_t
kernel: convert cgroup_namespace.count from atomic_t to refcount_t
a590b90d47 ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().
Silence the warning by adding __maybe_unused to cgroup_get().
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.au
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit bfb0b80db5 ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken. Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Tested-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.
WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
...
[<ffffffff8107e123>] __warn+0xd3/0xf0
[<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
[<ffffffff810ff465>] cgroup_get+0x55/0x60
[<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
[<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
[<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
[<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
[<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
[<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
[<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
[<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
[<ffffffff81837cb2>] ip6_input+0x32/0xb0
[<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
[<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
[<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
[<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
[<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
[<ffffffff817787d8>] napi_gro_frags+0x208/0x270
[<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
[<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
[<ffffffff81778b30>] net_rx_action+0x210/0x350
[<ffffffff8188c426>] __do_softirq+0x106/0x2c7
[<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
[<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
[<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
[<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
[<ffffffff8103edd1>] start_secondary+0xf1/0x100
This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.
All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().
Fixes: d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup fix from Tejun Heo:
"Unfortunately, the commit to fix the cgroup mount race in the previous
pull request can lead to hangs.
The original bug has been around for a while and isn't too likely to
be triggered in usual use cases. Revert the commit for now"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: avoid attaching a cgroup root to two different superblocks"
This reverts commit bfb0b80db5.
Andrei reports CRIU test hangs with the patch applied. The bug fixed
by the patch isn't too likely to trigger in actual uses. Revert the
patch for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Link: http://lkml.kernel.org/r/20170414232737.GC20350@outlook.office365.com
Pull cgroup fixes from Tejun Heo:
"This contains fixes for two long standing subtle bugs:
- kthread_bind() on a new kthread binds it to specific CPUs and
prevents userland from messing with the affinity or cgroup
membership. Unfortunately, for cgroup membership, there's a window
between kthread creation and kthread_bind*() invocation where the
kthread can be moved into a non-root cgroup by userland.
Depending on what controllers are in effect, this can assign the
kthread unexpected attributes. For example, in the reported case,
workqueue workers ended up in a non-root cpuset cgroups and had
their CPU affinities overridden. This broke workqueue invariants
and led to workqueue stalls.
Fixed by closing the window between kthread creation and
kthread_bind() as suggested by Oleg.
- There was a bug in cgroup mount path which could allow two
competing mount attempts to attach the same cgroup_root to two
different superblocks.
This was caused by mishandling return value from kernfs_pin_sb().
Fixed"
* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: avoid attaching a cgroup root to two different superblocks
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Run this:
touch file0
for ((; ;))
{
mount -t cpuset xxx file0
}
And this concurrently:
touch file1
for ((; ;))
{
mount -t cpuset xxx file1
}
We'll trigger a warning like this:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
percpu_ref_kill_and_confirm called more than once on css_release!
CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
Call Trace:
dump_stack+0x63/0x84
__warn+0xd1/0xf0
warn_slowpath_fmt+0x5f/0x80
percpu_ref_kill_and_confirm+0x92/0xb0
cgroup_kill_sb+0x95/0xb0
deactivate_locked_super+0x43/0x70
deactivate_super+0x46/0x60
...
---[ end trace a79f61c2a2633700 ]---
Here's a race:
Thread A Thread B
cgroup1_mount()
# alloc a new cgroup root
cgroup_setup_root()
cgroup1_mount()
# no sb yet, returns NULL
kernfs_pin_sb()
# but succeeds in getting the refcnt,
# so re-use cgroup root
percpu_ref_tryget_live()
# alloc sb with cgroup root
cgroup_do_mount()
cgroup_kill_sb()
# alloc another sb with same root
cgroup_do_mount()
cgroup_kill_sb()
We end up using the same cgroup root for two different superblocks,
so percpu_ref_kill() will be called twice on the same root when the
two superblocks are destroyed.
We should fix to make sure the superblock pinning is really successful.
Cc: stable@vger.kernel.org # 3.16+
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
it.
Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use BUG_ON() rather than an explicit if followed by BUG() for
improved readability and also consistency.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Tejun Heo <tj@kernel.org>
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix typos and add the following to the scripts/spelling.txt:
disble||disable
disbled||disabled
I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched. The macro is not referenced at all, but this commit is
touching only comment blocks just in case.
Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As found in grsecurity, this avoids exposing a kernel pointer through
the cgroup debug entries.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pids_can_fork() is special in that the css association is guaranteed
to be stable throughout the function and thus doesn't need RCU
protection around task_css access. When determining the css to charge
the pid, task_css_check() is used to override the RCU sanity check.
While adding a warning message on fork rejection from pids limit,
135b8b37bd ("cgroup: Add pids controller event when fork fails
because of pid limit") incorrectly added a task_css access which is
neither RCU protected or explicitly annotated. This triggers the
following suspicious RCU usage warning when RCU debugging is enabled.
cgroup: fork rejected by pids controller in
===============================
[ ERR: suspicious RCU usage. ]
4.10.0-work+ #1 Not tainted
-------------------------------
./include/linux/cgroup.h:435 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
1 lock held by bash/1748:
#0: (&cgroup_threadgroup_rwsem){+++++.}, at: [<ffffffff81052c96>] _do_fork+0xe6/0x6e0
stack backtrace:
CPU: 3 PID: 1748 Comm: bash Not tainted 4.10.0-work+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
Call Trace:
dump_stack+0x68/0x93
lockdep_rcu_suspicious+0xd7/0x110
pids_can_fork+0x1c7/0x1d0
cgroup_can_fork+0x67/0xc0
copy_process.part.58+0x1709/0x1e90
_do_fork+0xe6/0x6e0
SyS_clone+0x19/0x20
do_syscall_64+0x5c/0x140
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7853fab93a
RSP: 002b:00007ffc12d05c90 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7853fab93a
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 00007ffc12d05cc0 R08: 0000000000000000 R09: 00007f78548db700
R10: 00007f78548db9d0 R11: 0000000000000246 R12: 00000000000006d4
R13: 0000000000000001 R14: 0000000000000000 R15: 000055e3ebe2c04d
/asdf
There's no reason to dereference task_css again here when the
associated css is already available. Fix it by replacing the
task_cgroup() call with css->cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Fixes: 135b8b37bd ("cgroup: Add pids controller event when fork fails because of pid limit")
Cc: Kenny Yu <kennyyu@fb.com>
Cc: stable@vger.kernel.org # v4.8+
Signed-off-by: Tejun Heo <tj@kernel.org>
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
needs it to pick up STACK_END_MAGIC.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The task_lock()/task_unlock() APIs are not realated to core scheduling,
they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.
Move them.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
task_struct::signal and task_struct::sighand are pointers, which would normally make it
straightforward to not define those types in sched.h.
That is not so, because the types are accompanied by a myriad of APIs (macros and inline
functions) that dereference them.
Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.
With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
trying to put accessors into sched.h as a test fails the following way:
./include/linux/sched.h: In function ‘test_signal_types’:
./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
^
This reduces the size and complexity of sched.h significantly.
Update all headers and .c code that relied on getting the signal handling
functionality from <linux/sched.h> to include <linux/sched/signal.h>.
The list of affected files in the preparatory patch was partly generated by
grepping for the APIs, and partly by doing coverage build testing, both
all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
cross-architecture builds.
Nevertheless some (trivial) build breakage is still expected related to rare
Kconfig combinations and in-flight patches to various kernel code, but most
of it should be handled by this patch.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
But first update the code that uses these facilities with the
new header.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
threadgroup_change_begin()/end() is a pointless wrapper around
cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
in the !CONFIG_CGROUPS=y case.
Remove the wrappery, move the might_sleep() (the down_read()
already has a might_sleep() check).
This debloats <linux/sched.h> a bit and simplifies this API.
Update all call sites.
No change in functionality.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull cgroup updates from Tejun Heo:
"Several noteworthy changes.
- Parav's rdma controller is finally merged. It is very straight
forward and can limit the abosolute numbers of common rdma
constructs used by different cgroups.
- kernel/cgroup.c got too chubby and disorganized. Created
kernel/cgroup/ subdirectory and moved all cgroup related files
under kernel/ there and reorganized the core code. This hurts for
backporting patches but was long overdue.
- cgroup v2 process listing reimplemented so that it no longer
depends on allocating a buffer large enough to cache the entire
result to sort and uniq the output. v2 has always mangled the sort
order to ensure that users don't depend on the sorted output, so
this shouldn't surprise anybody. This makes the pid listing
functions use the same iterators that are used internally, which
have to have the same iterating capabilities anyway.
- perf cgroup filtering now works automatically on cgroup v2. This
patch was posted a long time ago but somehow fell through the
cracks.
- misc fixes asnd documentation updates"
* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
kernfs: fix locking around kernfs_ops->release() callback
cgroup: drop the matching uid requirement on migration for cgroup v2
cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
cgroup: misc cleanups
cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
cgroup: track migration context in cgroup_mgctx
cgroup: cosmetic update to cgroup_taskset_add()
rdmacg: Fixed uninitialized current resource usage
cgroup: Add missing cgroup-v2 PID controller documentation.
rdmacg: Added documentation for rdmacg
IB/core: added support to use rdma cgroup controller
rdmacg: Added rdma cgroup controller
cgroup: fix a comment typo
cgroup: fix RCU related sparse warnings
cgroup: move namespace code to kernel/cgroup/namespace.c
cgroup: rename functions for consistency
cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
cgroup: separate out cgroup1_kf_syscall_ops
cgroup: refactor mount path and clearly distinguish v1 and v2 paths
cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
...
Merge in to resolve conflicts in Documentation/cgroup-v2.txt. The
conflicts are from multiple section additions and trivial to resolve.
Signed-off-by: Tejun Heo <tj@kernel.org>
Along with the write access to the cgroup.procs or tasks file, cgroup
has required the writer's euid, unless root, to match [s]uid of the
target process or task. On cgroup v1, this is necessary because
there's nothing preventing a delegatee from pulling in tasks or
processes from all over the system.
If a user has a cgroup subdirectory delegated to it, the user would
have write access to the cgroup.procs or tasks file. If there are no
further checks than file write access check, the user would be able to
pull processes from all over the system into its subhierarchy which is
clearly not the intended behavior. The matching [s]uid requirement
partially prevents this problem by allowing a delegatee to pull in the
processes that belongs to it. This isn't a sufficient protection
however, because a user would still be able to jump processes across
two disjoint sub-hierarchies that has been delegated to them.
cgroup v2 resolves the issue by requiring the writer to have access to
the common ancestor of the cgroup.procs file of the source and target
cgroups. This confines each delegatee to their own sub-hierarchy
proper and bases all permission decisions on the cgroup filesystem
rather than having to pull in explicit uid matching.
cgroup v2 has still been applying the matching [s]uid requirement just
for historical reasons. On cgroup2, the requirement doesn't serve any
purpose while unnecessarily complicating the permission model. Let's
drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
ss_masks. Make it a u16.
* Move have_canfork_callback together with other callback ss_masks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, subsys->*attach() callbacks are called for all subsystems
which are attached to the hierarchy on which the migration is taking
place.
With cgroup_migrate_prepare_dst() filtering out identity migrations,
v1 hierarchies can avoid spurious ->*attach() callback invocations
where the source and destination csses are identical; however, this
isn't enough on v2 as only a subset of the attached controllers can be
affected on controller enable/disable.
While spurious ->*attach() invocations aren't critically broken,
they're unnecessary overhead and can lead to temporary overcharges on
certain controllers. Fix it by tracking which subsystems are affected
by a migration and invoking ->*attach() callbacks only on those
subsystems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup migration is performed in four steps - css_set preloading,
addition of target tasks, actual migration, and clean up. A list
named preloaded_csets is used to track the preloading. This is a bit
too restricted and the code is already depending on the subtlety that
all source css_sets appear before destination ones.
Let's create struct cgroup_mgctx which keeps track of everything
during migration. Currently, it has separate preload lists for source
and destination csets and also embeds cgroup_taskset which is used
during the actual migration. This moves struct cgroup_taskset
definition to cgroup-internal.h.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
cgroup_taskset_add() was using list_add_tail() when for source csets
but list_move_tail() for destination. As the operations are gated by
list_empty() test, list_move_tail() is equivalent to list_add_tail()
here. Use list_add_tail() too for destination csets too.
This doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Fixed warning reported by kbuild test robot.
When reading current resource usage value, when no resources are
allocated, its possible that it can report a uninitialized value
for current resource usage.
This fix avoids it by initializing it to zero as no resource is
allocated.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Added rdma cgroup controller that does accounting, limit enforcement
on rdma/IB resources.
Added rdma cgroup header file which defines its APIs to perform
charging/uncharging functionality. It also defined APIs for RDMA/IB
stack for device registration. Devices which are registered will
participate in controller functions of accounting and limit
enforcements. It define rdmacg_device structure to bind IB stack
and RDMA cgroup controller.
RDMA resources are tracked using resource pool. Resource pool is per
device, per cgroup entity which allows setting up accounting limits
on per device basis.
Currently resources are defined by the RDMA cgroup.
Resource pool is created/destroyed dynamically whenever
charging/uncharging occurs respectively and whenever user
configuration is done. Its a tradeoff of memory vs little more code
space that creates resource pool object whenever necessary, instead of
creating them during cgroup creation and device registration time.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kn->priv which is a void * is used as a RCU pointer by cgroup. When
dereferencing it, it was passing kn->priv to rcu_derefreence() without
casting it into a RCU pointer triggering address space mismatch
warning from sparse. Fix them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
get/put_css_set() get exposed in cgroup-internal.h in the process.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
Now that the v1 mount code is split into separate functions, move them
to kernel/cgroup/cgroup-v1.c along with the mount option handling
code. As this puts all v1-only kernfs_syscall_ops in cgroup-v1.c,
move cgroup1_kf_syscall_ops to cgroup-v1.c too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
Currently, cgroup_kf_syscall_ops is shared by v1 and v2 and the
specific methods test the version and take different actions. Split
out v1 functions and put them in cgroup1_kf_syscall_ops and remove the
now unnecessary explicit branches in specific methods.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
While sharing some mechanisms, the mount paths of v1 and v2 are
substantially different. Their implementations were mixed in
cgroup_mount(). This patch splits them out so that they're easier to
follow and organize.
This patch causes one functional change - the WARN_ON(new_sb) gets
lost. This is because the actual mounting gets moved to
cgroup_do_mount() and thus @new_sb is no longer accessible by default
to cgroup1_mount(). While we can add it as an explicit out parameter
to cgroup_do_mount(), this part of code hasn't changed and the warning
hasn't triggered for quite a while. Dropping it should be fine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
cgroup.c is getting too unwieldy. Let's move out cgroup v1 specific
code along with the debug controller into kernel/cgroup/cgroup-v1.c.
v2: cgroup_mutex and css_set_lock made available in cgroup-internal.h
regardless of CONFIG_PROVE_RCU.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>
They're growing to be too many and planned to get split further. Move
them under their own directory.
kernel/cgroup.c -> kernel/cgroup/cgroup.c
kernel/cgroup_freezer.c -> kernel/cgroup/freezer.c
kernel/cgroup_pids.c -> kernel/cgroup/pids.c
kernel/cpuset.c -> kernel/cgroup/cpuset.c
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Acked-by: Zefan Li <lizefan@huawei.com>