462 lines
20 KiB
Plaintext
462 lines
20 KiB
Plaintext
|
|
Cgroup unified hierarchy
|
|
|
|
April, 2014 Tejun Heo <tj@kernel.org>
|
|
|
|
This document describes the changes made by unified hierarchy and
|
|
their rationales. It will eventually be merged into the main cgroup
|
|
documentation.
|
|
|
|
CONTENTS
|
|
|
|
1. Background
|
|
2. Basic Operation
|
|
2-1. Mounting
|
|
2-2. cgroup.subtree_control
|
|
2-3. cgroup.controllers
|
|
3. Structural Constraints
|
|
3-1. Top-down
|
|
3-2. No internal tasks
|
|
4. Other Changes
|
|
4-1. [Un]populated Notification
|
|
4-2. Other Core Changes
|
|
4-3. Per-Controller Changes
|
|
4-3-1. blkio
|
|
4-3-2. cpuset
|
|
4-3-3. memory
|
|
5. Planned Changes
|
|
5-1. CAP for resource control
|
|
|
|
|
|
1. Background
|
|
|
|
cgroup allows an arbitrary number of hierarchies and each hierarchy
|
|
can host any number of controllers. While this seems to provide a
|
|
high level of flexibility, it isn't quite useful in practice.
|
|
|
|
For example, as there is only one instance of each controller, utility
|
|
type controllers such as freezer which can be useful in all
|
|
hierarchies can only be used in one. The issue is exacerbated by the
|
|
fact that controllers can't be moved around once hierarchies are
|
|
populated. Another issue is that all controllers bound to a hierarchy
|
|
are forced to have exactly the same view of the hierarchy. It isn't
|
|
possible to vary the granularity depending on the specific controller.
|
|
|
|
In practice, these issues heavily limit which controllers can be put
|
|
on the same hierarchy and most configurations resort to putting each
|
|
controller on its own hierarchy. Only closely related ones, such as
|
|
the cpu and cpuacct controllers, make sense to put on the same
|
|
hierarchy. This often means that userland ends up managing multiple
|
|
similar hierarchies repeating the same steps on each hierarchy
|
|
whenever a hierarchy management operation is necessary.
|
|
|
|
Unfortunately, support for multiple hierarchies comes at a steep cost.
|
|
Internal implementation in cgroup core proper is dazzlingly
|
|
complicated but more importantly the support for multiple hierarchies
|
|
restricts how cgroup is used in general and what controllers can do.
|
|
|
|
There's no limit on how many hierarchies there may be, which means
|
|
that a task's cgroup membership can't be described in finite length.
|
|
The key may contain any varying number of entries and is unlimited in
|
|
length, which makes it highly awkward to handle and leads to addition
|
|
of controllers which exist only to identify membership, which in turn
|
|
exacerbates the original problem.
|
|
|
|
Also, as a controller can't have any expectation regarding what shape
|
|
of hierarchies other controllers would be on, each controller has to
|
|
assume that all other controllers are operating on completely
|
|
orthogonal hierarchies. This makes it impossible, or at least very
|
|
cumbersome, for controllers to cooperate with each other.
|
|
|
|
In most use cases, putting controllers on hierarchies which are
|
|
completely orthogonal to each other isn't necessary. What usually is
|
|
called for is the ability to have differing levels of granularity
|
|
depending on the specific controller. In other words, hierarchy may
|
|
be collapsed from leaf towards root when viewed from specific
|
|
controllers. For example, a given configuration might not care about
|
|
how memory is distributed beyond a certain level while still wanting
|
|
to control how CPU cycles are distributed.
|
|
|
|
Unified hierarchy is the next version of cgroup interface. It aims to
|
|
address the aforementioned issues by having more structure while
|
|
retaining enough flexibility for most use cases. Various other
|
|
general and controller-specific interface issues are also addressed in
|
|
the process.
|
|
|
|
|
|
2. Basic Operation
|
|
|
|
2-1. Mounting
|
|
|
|
Currently, unified hierarchy can be mounted with the following mount
|
|
command. Note that this is still under development and scheduled to
|
|
change soon.
|
|
|
|
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
|
|
|
|
All controllers which support the unified hierarchy and are not bound
|
|
to other hierarchies are automatically bound to unified hierarchy and
|
|
show up at the root of it. Controllers which are enabled only in the
|
|
root of unified hierarchy can be bound to other hierarchies. This
|
|
allows mixing unified hierarchy with the traditional multiple
|
|
hierarchies in a fully backward compatible way.
|
|
|
|
For development purposes, the following boot parameter makes all
|
|
controllers to appear on the unified hierarchy whether supported or
|
|
not.
|
|
|
|
cgroup__DEVEL__legacy_files_on_dfl
|
|
|
|
A controller can be moved across hierarchies only after the controller
|
|
is no longer referenced in its current hierarchy. Because per-cgroup
|
|
controller states are destroyed asynchronously and controllers may
|
|
have lingering references, a controller may not show up immediately on
|
|
the unified hierarchy after the final umount of the previous
|
|
hierarchy. Similarly, a controller should be fully disabled to be
|
|
moved out of the unified hierarchy and it may take some time for the
|
|
disabled controller to become available for other hierarchies;
|
|
furthermore, due to dependencies among controllers, other controllers
|
|
may need to be disabled too.
|
|
|
|
While useful for development and manual configurations, dynamically
|
|
moving controllers between the unified and other hierarchies is
|
|
strongly discouraged for production use. It is recommended to decide
|
|
the hierarchies and controller associations before starting using the
|
|
controllers.
|
|
|
|
|
|
2-2. cgroup.subtree_control
|
|
|
|
All cgroups on unified hierarchy have a "cgroup.subtree_control" file
|
|
which governs which controllers are enabled on the children of the
|
|
cgroup. Let's assume a hierarchy like the following.
|
|
|
|
root - A - B - C
|
|
\ D
|
|
|
|
root's "cgroup.subtree_control" file determines which controllers are
|
|
enabled on A. A's on B. B's on C and D. This coincides with the
|
|
fact that controllers on the immediate sub-level are used to
|
|
distribute the resources of the parent. In fact, it's natural to
|
|
assume that resource control knobs of a child belong to its parent.
|
|
Enabling a controller in a "cgroup.subtree_control" file declares that
|
|
distribution of the respective resources of the cgroup will be
|
|
controlled. Note that this means that controller enable states are
|
|
shared among siblings.
|
|
|
|
When read, the file contains a space-separated list of currently
|
|
enabled controllers. A write to the file should contain a
|
|
space-separated list of controllers with '+' or '-' prefixed (without
|
|
the quotes). Controllers prefixed with '+' are enabled and '-'
|
|
disabled. If a controller is listed multiple times, the last entry
|
|
wins. The specific operations are executed atomically - either all
|
|
succeed or fail.
|
|
|
|
|
|
2-3. cgroup.controllers
|
|
|
|
Read-only "cgroup.controllers" file contains a space-separated list of
|
|
controllers which can be enabled in the cgroup's
|
|
"cgroup.subtree_control" file.
|
|
|
|
In the root cgroup, this lists controllers which are not bound to
|
|
other hierarchies and the content changes as controllers are bound to
|
|
and unbound from other hierarchies.
|
|
|
|
In non-root cgroups, the content of this file equals that of the
|
|
parent's "cgroup.subtree_control" file as only controllers enabled
|
|
from the parent can be used in its children.
|
|
|
|
|
|
3. Structural Constraints
|
|
|
|
3-1. Top-down
|
|
|
|
As it doesn't make sense to nest control of an uncontrolled resource,
|
|
all non-root "cgroup.subtree_control" files can only contain
|
|
controllers which are enabled in the parent's "cgroup.subtree_control"
|
|
file. A controller can be enabled only if the parent has the
|
|
controller enabled and a controller can't be disabled if one or more
|
|
children have it enabled.
|
|
|
|
|
|
3-2. No internal tasks
|
|
|
|
One long-standing issue that cgroup faces is the competition between
|
|
tasks belonging to the parent cgroup and its children cgroups. This
|
|
is inherently nasty as two different types of entities compete and
|
|
there is no agreed-upon obvious way to handle it. Different
|
|
controllers are doing different things.
|
|
|
|
The cpu controller considers tasks and cgroups as equivalents and maps
|
|
nice levels to cgroup weights. This works for some cases but falls
|
|
flat when children should be allocated specific ratios of CPU cycles
|
|
and the number of internal tasks fluctuates - the ratios constantly
|
|
change as the number of competing entities fluctuates. There also are
|
|
other issues. The mapping from nice level to weight isn't obvious or
|
|
universal, and there are various other knobs which simply aren't
|
|
available for tasks.
|
|
|
|
The blkio controller implicitly creates a hidden leaf node for each
|
|
cgroup to host the tasks. The hidden leaf has its own copies of all
|
|
the knobs with "leaf_" prefixed. While this allows equivalent control
|
|
over internal tasks, it's with serious drawbacks. It always adds an
|
|
extra layer of nesting which may not be necessary, makes the interface
|
|
messy and significantly complicates the implementation.
|
|
|
|
The memory controller currently doesn't have a way to control what
|
|
happens between internal tasks and child cgroups and the behavior is
|
|
not clearly defined. There have been attempts to add ad-hoc behaviors
|
|
and knobs to tailor the behavior to specific workloads. Continuing
|
|
this direction will lead to problems which will be extremely difficult
|
|
to resolve in the long term.
|
|
|
|
Multiple controllers struggle with internal tasks and came up with
|
|
different ways to deal with it; unfortunately, all the approaches in
|
|
use now are severely flawed and, furthermore, the widely different
|
|
behaviors make cgroup as whole highly inconsistent.
|
|
|
|
It is clear that this is something which needs to be addressed from
|
|
cgroup core proper in a uniform way so that controllers don't need to
|
|
worry about it and cgroup as a whole shows a consistent and logical
|
|
behavior. To achieve that, unified hierarchy enforces the following
|
|
structural constraint:
|
|
|
|
Except for the root, only cgroups which don't contain any task may
|
|
have controllers enabled in their "cgroup.subtree_control" files.
|
|
|
|
Combined with other properties, this guarantees that, when a
|
|
controller is looking at the part of the hierarchy which has it
|
|
enabled, tasks are always only on the leaves. This rules out
|
|
situations where child cgroups compete against internal tasks of the
|
|
parent.
|
|
|
|
There are two things to note. Firstly, the root cgroup is exempt from
|
|
the restriction. Root contains tasks and anonymous resource
|
|
consumption which can't be associated with any other cgroup and
|
|
requires special treatment from most controllers. How resource
|
|
consumption in the root cgroup is governed is up to each controller.
|
|
|
|
Secondly, the restriction doesn't take effect if there is no enabled
|
|
controller in the cgroup's "cgroup.subtree_control" file. This is
|
|
important as otherwise it wouldn't be possible to create children of a
|
|
populated cgroup. To control resource distribution of a cgroup, the
|
|
cgroup must create children and transfer all its tasks to the children
|
|
before enabling controllers in its "cgroup.subtree_control" file.
|
|
|
|
|
|
4. Other Changes
|
|
|
|
4-1. [Un]populated Notification
|
|
|
|
cgroup users often need a way to determine when a cgroup's
|
|
subhierarchy becomes empty so that it can be cleaned up. cgroup
|
|
currently provides release_agent for it; unfortunately, this mechanism
|
|
is riddled with issues.
|
|
|
|
- It delivers events by forking and execing a userland binary
|
|
specified as the release_agent. This is a long deprecated method of
|
|
notification delivery. It's extremely heavy, slow and cumbersome to
|
|
integrate with larger infrastructure.
|
|
|
|
- There is single monitoring point at the root. There's no way to
|
|
delegate management of a subtree.
|
|
|
|
- The event isn't recursive. It triggers when a cgroup doesn't have
|
|
any tasks or child cgroups. Events for internal nodes trigger only
|
|
after all children are removed. This again makes it impossible to
|
|
delegate management of a subtree.
|
|
|
|
- Events are filtered from the kernel side. A "notify_on_release"
|
|
file is used to subscribe to or suppress release events. This is
|
|
unnecessarily complicated and probably done this way because event
|
|
delivery itself was expensive.
|
|
|
|
Unified hierarchy implements an interface file "cgroup.populated"
|
|
which can be used to monitor whether the cgroup's subhierarchy has
|
|
tasks in it or not. Its value is 0 if there is no task in the cgroup
|
|
and its descendants; otherwise, 1. poll and [id]notify events are
|
|
triggered when the value changes.
|
|
|
|
This is significantly lighter and simpler and trivially allows
|
|
delegating management of subhierarchy - subhierarchy monitoring can
|
|
block further propagation simply by putting itself or another process
|
|
in the subhierarchy and monitor events that it's interested in from
|
|
there without interfering with monitoring higher in the tree.
|
|
|
|
In unified hierarchy, the release_agent mechanism is no longer
|
|
supported and the interface files "release_agent" and
|
|
"notify_on_release" do not exist.
|
|
|
|
|
|
4-2. Other Core Changes
|
|
|
|
- None of the mount options is allowed.
|
|
|
|
- remount is disallowed.
|
|
|
|
- rename(2) is disallowed.
|
|
|
|
- The "tasks" file is removed. Everything should at process
|
|
granularity. Use the "cgroup.procs" file instead.
|
|
|
|
- The "cgroup.procs" file is not sorted. pids will be unique unless
|
|
they got recycled in-between reads.
|
|
|
|
- The "cgroup.clone_children" file is removed.
|
|
|
|
|
|
4-3. Per-Controller Changes
|
|
|
|
4-3-1. blkio
|
|
|
|
- blk-throttle becomes properly hierarchical.
|
|
|
|
|
|
4-3-2. cpuset
|
|
|
|
- Tasks are kept in empty cpusets after hotplug and take on the masks
|
|
of the nearest non-empty ancestor, instead of being moved to it.
|
|
|
|
- A task can be moved into an empty cpuset, and again it takes on the
|
|
masks of the nearest non-empty ancestor.
|
|
|
|
|
|
4-3-3. memory
|
|
|
|
- use_hierarchy is on by default and the cgroup file for the flag is
|
|
not created.
|
|
|
|
- The original lower boundary, the soft limit, is defined as a limit
|
|
that is per default unset. As a result, the set of cgroups that
|
|
global reclaim prefers is opt-in, rather than opt-out. The costs
|
|
for optimizing these mostly negative lookups are so high that the
|
|
implementation, despite its enormous size, does not even provide the
|
|
basic desirable behavior. First off, the soft limit has no
|
|
hierarchical meaning. All configured groups are organized in a
|
|
global rbtree and treated like equal peers, regardless where they
|
|
are located in the hierarchy. This makes subtree delegation
|
|
impossible. Second, the soft limit reclaim pass is so aggressive
|
|
that it not just introduces high allocation latencies into the
|
|
system, but also impacts system performance due to overreclaim, to
|
|
the point where the feature becomes self-defeating.
|
|
|
|
The memory.low boundary on the other hand is a top-down allocated
|
|
reserve. A cgroup enjoys reclaim protection when it and all its
|
|
ancestors are below their low boundaries, which makes delegation of
|
|
subtrees possible. Secondly, new cgroups have no reserve per
|
|
default and in the common case most cgroups are eligible for the
|
|
preferred reclaim pass. This allows the new low boundary to be
|
|
efficiently implemented with just a minor addition to the generic
|
|
reclaim code, without the need for out-of-band data structures and
|
|
reclaim passes. Because the generic reclaim code considers all
|
|
cgroups except for the ones running low in the preferred first
|
|
reclaim pass, overreclaim of individual groups is eliminated as
|
|
well, resulting in much better overall workload performance.
|
|
|
|
- The original high boundary, the hard limit, is defined as a strict
|
|
limit that can not budge, even if the OOM killer has to be called.
|
|
But this generally goes against the goal of making the most out of
|
|
the available memory. The memory consumption of workloads varies
|
|
during runtime, and that requires users to overcommit. But doing
|
|
that with a strict upper limit requires either a fairly accurate
|
|
prediction of the working set size or adding slack to the limit.
|
|
Since working set size estimation is hard and error prone, and
|
|
getting it wrong results in OOM kills, most users tend to err on the
|
|
side of a looser limit and end up wasting precious resources.
|
|
|
|
The memory.high boundary on the other hand can be set much more
|
|
conservatively. When hit, it throttles allocations by forcing them
|
|
into direct reclaim to work off the excess, but it never invokes the
|
|
OOM killer. As a result, a high boundary that is chosen too
|
|
aggressively will not terminate the processes, but instead it will
|
|
lead to gradual performance degradation. The user can monitor this
|
|
and make corrections until the minimal memory footprint that still
|
|
gives acceptable performance is found.
|
|
|
|
In extreme cases, with many concurrent allocations and a complete
|
|
breakdown of reclaim progress within the group, the high boundary
|
|
can be exceeded. But even then it's mostly better to satisfy the
|
|
allocation from the slack available in other groups or the rest of
|
|
the system than killing the group. Otherwise, memory.max is there
|
|
to limit this type of spillover and ultimately contain buggy or even
|
|
malicious applications.
|
|
|
|
- The original control file names are unwieldy and inconsistent in
|
|
many different ways. For example, the upper boundary hit count is
|
|
exported in the memory.failcnt file, but an OOM event count has to
|
|
be manually counted by listening to memory.oom_control events, and
|
|
lower boundary / soft limit events have to be counted by first
|
|
setting a threshold for that value and then counting those events.
|
|
Also, usage and limit files encode their units in the filename.
|
|
That makes the filenames very long, even though this is not
|
|
information that a user needs to be reminded of every time they type
|
|
out those names.
|
|
|
|
To address these naming issues, as well as to signal clearly that
|
|
the new interface carries a new configuration model, the naming
|
|
conventions in it necessarily differ from the old interface.
|
|
|
|
- The original limit files indicate the state of an unset limit with a
|
|
Very High Number, and a configured limit can be unset by echoing -1
|
|
into those files. But that very high number is implementation and
|
|
architecture dependent and not very descriptive. And while -1 can
|
|
be understood as an underflow into the highest possible value, -2 or
|
|
-10M etc. do not work, so it's not consistent.
|
|
|
|
memory.low, memory.high, and memory.max will use the string "max" to
|
|
indicate and set the highest possible value.
|
|
|
|
5. Planned Changes
|
|
|
|
5-1. CAP for resource control
|
|
|
|
Unified hierarchy will require one of the capabilities(7), which is
|
|
yet to be decided, for all resource control related knobs. Process
|
|
organization operations - creation of sub-cgroups and migration of
|
|
processes in sub-hierarchies may be delegated by changing the
|
|
ownership and/or permissions on the cgroup directory and
|
|
"cgroup.procs" interface file; however, all operations which affect
|
|
resource control - writes to a "cgroup.subtree_control" file or any
|
|
controller-specific knobs - will require an explicit CAP privilege.
|
|
|
|
This, in part, is to prevent the cgroup interface from being
|
|
inadvertently promoted to programmable API used by non-privileged
|
|
binaries. cgroup exposes various aspects of the system in ways which
|
|
aren't properly abstracted for direct consumption by regular programs.
|
|
This is an administration interface much closer to sysctl knobs than
|
|
system calls. Even the basic access model, being filesystem path
|
|
based, isn't suitable for direct consumption. There's no way to
|
|
access "my cgroup" in a race-free way or make multiple operations
|
|
atomic against migration to another cgroup.
|
|
|
|
Another aspect is that, for better or for worse, the cgroup interface
|
|
goes through far less scrutiny than regular interfaces for
|
|
unprivileged userland. The upside is that cgroup is able to expose
|
|
useful features which may not be suitable for general consumption in a
|
|
reasonable time frame. It provides a relatively short path between
|
|
internal details and userland-visible interface. Of course, this
|
|
shortcut comes with high risk. We go through what we go through for
|
|
general kernel APIs for good reasons. It may end up leaking internal
|
|
details in a way which can exert significant pain by locking the
|
|
kernel into a contract that can't be maintained in a reasonable
|
|
manner.
|
|
|
|
Also, due to the specific nature, cgroup and its controllers don't
|
|
tend to attract attention from a wide scope of developers. cgroup's
|
|
short history is already fraught with severely mis-designed
|
|
interfaces, unnecessary commitments to and exposing of internal
|
|
details, broken and dangerous implementations of various features.
|
|
|
|
Keeping cgroup as an administration interface is both advantageous for
|
|
its role and imperative given its nature. Some of the cgroup features
|
|
may make sense for unprivileged access. If deemed justified, those
|
|
must be further abstracted and implemented as a different interface,
|
|
be it a system call or process-private filesystem, and survive through
|
|
the scrutiny that any interface for general consumption is required to
|
|
go through.
|
|
|
|
Requiring CAP is not a complete solution but should serve as a
|
|
significant deterrent against spraying cgroup usages in non-privileged
|
|
programs.
|