cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
Now that cgroup v2 is almost out of the door, replace the development documentation unified-hierarchy.txt with Documentation/cgroup.txt which is a superset of unified-hierarchy.txt and authoritatively describes all userland-visible aspects of cgroup. v2: Updated to include all information from blkio-controller.txt and list filesystems which support cgroup writeback as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com>
This commit is contained in:
parent
0d94276645
commit
6c2920926b
|
@ -374,82 +374,3 @@ One can experience an overall throughput drop if you have created multiple
|
|||
groups and put applications in that group which are not driving enough
|
||||
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
|
||||
on individual groups and throughput should improve.
|
||||
|
||||
Writeback
|
||||
=========
|
||||
|
||||
Page cache is dirtied through buffered writes and shared mmaps and
|
||||
written asynchronously to the backing filesystem by the writeback
|
||||
mechanism. Writeback sits between the memory and IO domains and
|
||||
regulates the proportion of dirty memory by balancing dirtying and
|
||||
write IOs.
|
||||
|
||||
On traditional cgroup hierarchies, relationships between different
|
||||
controllers cannot be established making it impossible for writeback
|
||||
to operate accounting for cgroup resource restrictions and all
|
||||
writeback IOs are attributed to the root cgroup.
|
||||
|
||||
If both the blkio and memory controllers are used on the v2 hierarchy
|
||||
and the filesystem supports cgroup writeback, writeback operations
|
||||
correctly follow the resource restrictions imposed by both memory and
|
||||
blkio controllers.
|
||||
|
||||
Writeback examines both system-wide and per-cgroup dirty memory status
|
||||
and enforces the more restrictive of the two. Also, writeback control
|
||||
parameters which are absolute values - vm.dirty_bytes and
|
||||
vm.dirty_background_bytes - are distributed across cgroups according
|
||||
to their current writeback bandwidth.
|
||||
|
||||
There's a peculiarity stemming from the discrepancy in ownership
|
||||
granularity between memory controller and writeback. While memory
|
||||
controller tracks ownership per page, writeback operates on inode
|
||||
basis. cgroup writeback bridges the gap by tracking ownership by
|
||||
inode but migrating ownership if too many foreign pages, pages which
|
||||
don't match the current inode ownership, have been encountered while
|
||||
writing back the inode.
|
||||
|
||||
This is a conscious design choice as writeback operations are
|
||||
inherently tied to inodes making strictly following page ownership
|
||||
complicated and inefficient. The only use case which suffers from
|
||||
this compromise is multiple cgroups concurrently dirtying disjoint
|
||||
regions of the same inode, which is an unlikely use case and decided
|
||||
to be unsupported. Note that as memory controller assigns page
|
||||
ownership on the first use and doesn't update it until the page is
|
||||
released, even if cgroup writeback strictly follows page ownership,
|
||||
multiple cgroups dirtying overlapping areas wouldn't work as expected.
|
||||
In general, write-sharing an inode across multiple cgroups is not well
|
||||
supported.
|
||||
|
||||
Filesystem support for cgroup writeback
|
||||
---------------------------------------
|
||||
|
||||
A filesystem can make writeback IOs cgroup-aware by updating
|
||||
address_space_operations->writepage[s]() to annotate bio's using the
|
||||
following two functions.
|
||||
|
||||
* wbc_init_bio(@wbc, @bio)
|
||||
|
||||
Should be called for each bio carrying writeback data and associates
|
||||
the bio with the inode's owner cgroup. Can be called anytime
|
||||
between bio allocation and submission.
|
||||
|
||||
* wbc_account_io(@wbc, @page, @bytes)
|
||||
|
||||
Should be called for each data segment being written out. While
|
||||
this function doesn't care exactly when it's called during the
|
||||
writeback session, it's the easiest and most natural to call it as
|
||||
data segments are added to a bio.
|
||||
|
||||
With writeback bio's annotated, cgroup support can be enabled per
|
||||
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
|
||||
selective disabling of cgroup writeback support which is helpful when
|
||||
certain filesystem features, e.g. journaled data mode, are
|
||||
incompatible.
|
||||
|
||||
wbc_init_bio() binds the specified bio to its cgroup. Depending on
|
||||
the configuration, the bio may be executed at a lower priority and if
|
||||
the writeback session is holding shared resources, e.g. a journal
|
||||
entry, may lead to priority inversion. There is no one easy solution
|
||||
for the problem. Filesystems can try to work around specific problem
|
||||
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
|
||||
directly.
|
||||
|
|
|
@ -1,645 +0,0 @@
|
|||
|
||||
Cgroup unified hierarchy
|
||||
|
||||
April, 2014 Tejun Heo <tj@kernel.org>
|
||||
|
||||
This document describes the changes made by unified hierarchy and
|
||||
their rationales. It will eventually be merged into the main cgroup
|
||||
documentation.
|
||||
|
||||
CONTENTS
|
||||
|
||||
1. Background
|
||||
2. Basic Operation
|
||||
2-1. Mounting
|
||||
2-2. cgroup.subtree_control
|
||||
2-3. cgroup.controllers
|
||||
3. Structural Constraints
|
||||
3-1. Top-down
|
||||
3-2. No internal tasks
|
||||
4. Delegation
|
||||
4-1. Model of delegation
|
||||
4-2. Common ancestor rule
|
||||
5. Other Changes
|
||||
5-1. [Un]populated Notification
|
||||
5-2. Other Core Changes
|
||||
5-3. Controller File Conventions
|
||||
5-3-1. Format
|
||||
5-3-2. Control Knobs
|
||||
5-4. Per-Controller Changes
|
||||
5-4-1. io
|
||||
5-4-2. cpuset
|
||||
5-4-3. memory
|
||||
6. Planned Changes
|
||||
6-1. CAP for resource control
|
||||
|
||||
|
||||
1. Background
|
||||
|
||||
cgroup allows an arbitrary number of hierarchies and each hierarchy
|
||||
can host any number of controllers. While this seems to provide a
|
||||
high level of flexibility, it isn't quite useful in practice.
|
||||
|
||||
For example, as there is only one instance of each controller, utility
|
||||
type controllers such as freezer which can be useful in all
|
||||
hierarchies can only be used in one. The issue is exacerbated by the
|
||||
fact that controllers can't be moved around once hierarchies are
|
||||
populated. Another issue is that all controllers bound to a hierarchy
|
||||
are forced to have exactly the same view of the hierarchy. It isn't
|
||||
possible to vary the granularity depending on the specific controller.
|
||||
|
||||
In practice, these issues heavily limit which controllers can be put
|
||||
on the same hierarchy and most configurations resort to putting each
|
||||
controller on its own hierarchy. Only closely related ones, such as
|
||||
the cpu and cpuacct controllers, make sense to put on the same
|
||||
hierarchy. This often means that userland ends up managing multiple
|
||||
similar hierarchies repeating the same steps on each hierarchy
|
||||
whenever a hierarchy management operation is necessary.
|
||||
|
||||
Unfortunately, support for multiple hierarchies comes at a steep cost.
|
||||
Internal implementation in cgroup core proper is dazzlingly
|
||||
complicated but more importantly the support for multiple hierarchies
|
||||
restricts how cgroup is used in general and what controllers can do.
|
||||
|
||||
There's no limit on how many hierarchies there may be, which means
|
||||
that a task's cgroup membership can't be described in finite length.
|
||||
The key may contain any varying number of entries and is unlimited in
|
||||
length, which makes it highly awkward to handle and leads to addition
|
||||
of controllers which exist only to identify membership, which in turn
|
||||
exacerbates the original problem.
|
||||
|
||||
Also, as a controller can't have any expectation regarding what shape
|
||||
of hierarchies other controllers would be on, each controller has to
|
||||
assume that all other controllers are operating on completely
|
||||
orthogonal hierarchies. This makes it impossible, or at least very
|
||||
cumbersome, for controllers to cooperate with each other.
|
||||
|
||||
In most use cases, putting controllers on hierarchies which are
|
||||
completely orthogonal to each other isn't necessary. What usually is
|
||||
called for is the ability to have differing levels of granularity
|
||||
depending on the specific controller. In other words, hierarchy may
|
||||
be collapsed from leaf towards root when viewed from specific
|
||||
controllers. For example, a given configuration might not care about
|
||||
how memory is distributed beyond a certain level while still wanting
|
||||
to control how CPU cycles are distributed.
|
||||
|
||||
Unified hierarchy is the next version of cgroup interface. It aims to
|
||||
address the aforementioned issues by having more structure while
|
||||
retaining enough flexibility for most use cases. Various other
|
||||
general and controller-specific interface issues are also addressed in
|
||||
the process.
|
||||
|
||||
|
||||
2. Basic Operation
|
||||
|
||||
2-1. Mounting
|
||||
|
||||
Unified hierarchy can be mounted with the following mount command.
|
||||
|
||||
mount -t cgroup2 none $MOUNT_POINT
|
||||
|
||||
All controllers which support the unified hierarchy and are not bound
|
||||
to other hierarchies are automatically bound to unified hierarchy and
|
||||
show up at the root of it. Controllers which are enabled only in the
|
||||
root of unified hierarchy can be bound to other hierarchies. This
|
||||
allows mixing unified hierarchy with the traditional multiple
|
||||
hierarchies in a fully backward compatible way.
|
||||
|
||||
A controller can be moved across hierarchies only after the controller
|
||||
is no longer referenced in its current hierarchy. Because per-cgroup
|
||||
controller states are destroyed asynchronously and controllers may
|
||||
have lingering references, a controller may not show up immediately on
|
||||
the unified hierarchy after the final umount of the previous
|
||||
hierarchy. Similarly, a controller should be fully disabled to be
|
||||
moved out of the unified hierarchy and it may take some time for the
|
||||
disabled controller to become available for other hierarchies;
|
||||
furthermore, due to dependencies among controllers, other controllers
|
||||
may need to be disabled too.
|
||||
|
||||
While useful for development and manual configurations, dynamically
|
||||
moving controllers between the unified and other hierarchies is
|
||||
strongly discouraged for production use. It is recommended to decide
|
||||
the hierarchies and controller associations before starting using the
|
||||
controllers.
|
||||
|
||||
|
||||
2-2. cgroup.subtree_control
|
||||
|
||||
All cgroups on unified hierarchy have a "cgroup.subtree_control" file
|
||||
which governs which controllers are enabled on the children of the
|
||||
cgroup. Let's assume a hierarchy like the following.
|
||||
|
||||
root - A - B - C
|
||||
\ D
|
||||
|
||||
root's "cgroup.subtree_control" file determines which controllers are
|
||||
enabled on A. A's on B. B's on C and D. This coincides with the
|
||||
fact that controllers on the immediate sub-level are used to
|
||||
distribute the resources of the parent. In fact, it's natural to
|
||||
assume that resource control knobs of a child belong to its parent.
|
||||
Enabling a controller in a "cgroup.subtree_control" file declares that
|
||||
distribution of the respective resources of the cgroup will be
|
||||
controlled. Note that this means that controller enable states are
|
||||
shared among siblings.
|
||||
|
||||
When read, the file contains a space-separated list of currently
|
||||
enabled controllers. A write to the file should contain a
|
||||
space-separated list of controllers with '+' or '-' prefixed (without
|
||||
the quotes). Controllers prefixed with '+' are enabled and '-'
|
||||
disabled. If a controller is listed multiple times, the last entry
|
||||
wins. The specific operations are executed atomically - either all
|
||||
succeed or fail.
|
||||
|
||||
|
||||
2-3. cgroup.controllers
|
||||
|
||||
Read-only "cgroup.controllers" file contains a space-separated list of
|
||||
controllers which can be enabled in the cgroup's
|
||||
"cgroup.subtree_control" file.
|
||||
|
||||
In the root cgroup, this lists controllers which are not bound to
|
||||
other hierarchies and the content changes as controllers are bound to
|
||||
and unbound from other hierarchies.
|
||||
|
||||
In non-root cgroups, the content of this file equals that of the
|
||||
parent's "cgroup.subtree_control" file as only controllers enabled
|
||||
from the parent can be used in its children.
|
||||
|
||||
|
||||
3. Structural Constraints
|
||||
|
||||
3-1. Top-down
|
||||
|
||||
As it doesn't make sense to nest control of an uncontrolled resource,
|
||||
all non-root "cgroup.subtree_control" files can only contain
|
||||
controllers which are enabled in the parent's "cgroup.subtree_control"
|
||||
file. A controller can be enabled only if the parent has the
|
||||
controller enabled and a controller can't be disabled if one or more
|
||||
children have it enabled.
|
||||
|
||||
|
||||
3-2. No internal tasks
|
||||
|
||||
One long-standing issue that cgroup faces is the competition between
|
||||
tasks belonging to the parent cgroup and its children cgroups. This
|
||||
is inherently nasty as two different types of entities compete and
|
||||
there is no agreed-upon obvious way to handle it. Different
|
||||
controllers are doing different things.
|
||||
|
||||
The cpu controller considers tasks and cgroups as equivalents and maps
|
||||
nice levels to cgroup weights. This works for some cases but falls
|
||||
flat when children should be allocated specific ratios of CPU cycles
|
||||
and the number of internal tasks fluctuates - the ratios constantly
|
||||
change as the number of competing entities fluctuates. There also are
|
||||
other issues. The mapping from nice level to weight isn't obvious or
|
||||
universal, and there are various other knobs which simply aren't
|
||||
available for tasks.
|
||||
|
||||
The io controller implicitly creates a hidden leaf node for each
|
||||
cgroup to host the tasks. The hidden leaf has its own copies of all
|
||||
the knobs with "leaf_" prefixed. While this allows equivalent control
|
||||
over internal tasks, it's with serious drawbacks. It always adds an
|
||||
extra layer of nesting which may not be necessary, makes the interface
|
||||
messy and significantly complicates the implementation.
|
||||
|
||||
The memory controller currently doesn't have a way to control what
|
||||
happens between internal tasks and child cgroups and the behavior is
|
||||
not clearly defined. There have been attempts to add ad-hoc behaviors
|
||||
and knobs to tailor the behavior to specific workloads. Continuing
|
||||
this direction will lead to problems which will be extremely difficult
|
||||
to resolve in the long term.
|
||||
|
||||
Multiple controllers struggle with internal tasks and came up with
|
||||
different ways to deal with it; unfortunately, all the approaches in
|
||||
use now are severely flawed and, furthermore, the widely different
|
||||
behaviors make cgroup as whole highly inconsistent.
|
||||
|
||||
It is clear that this is something which needs to be addressed from
|
||||
cgroup core proper in a uniform way so that controllers don't need to
|
||||
worry about it and cgroup as a whole shows a consistent and logical
|
||||
behavior. To achieve that, unified hierarchy enforces the following
|
||||
structural constraint:
|
||||
|
||||
Except for the root, only cgroups which don't contain any task may
|
||||
have controllers enabled in their "cgroup.subtree_control" files.
|
||||
|
||||
Combined with other properties, this guarantees that, when a
|
||||
controller is looking at the part of the hierarchy which has it
|
||||
enabled, tasks are always only on the leaves. This rules out
|
||||
situations where child cgroups compete against internal tasks of the
|
||||
parent.
|
||||
|
||||
There are two things to note. Firstly, the root cgroup is exempt from
|
||||
the restriction. Root contains tasks and anonymous resource
|
||||
consumption which can't be associated with any other cgroup and
|
||||
requires special treatment from most controllers. How resource
|
||||
consumption in the root cgroup is governed is up to each controller.
|
||||
|
||||
Secondly, the restriction doesn't take effect if there is no enabled
|
||||
controller in the cgroup's "cgroup.subtree_control" file. This is
|
||||
important as otherwise it wouldn't be possible to create children of a
|
||||
populated cgroup. To control resource distribution of a cgroup, the
|
||||
cgroup must create children and transfer all its tasks to the children
|
||||
before enabling controllers in its "cgroup.subtree_control" file.
|
||||
|
||||
|
||||
4. Delegation
|
||||
|
||||
4-1. Model of delegation
|
||||
|
||||
A cgroup can be delegated to a less privileged user by granting write
|
||||
access of the directory and its "cgroup.procs" file to the user. Note
|
||||
that the resource control knobs in a given directory concern the
|
||||
resources of the parent and thus must not be delegated along with the
|
||||
directory.
|
||||
|
||||
Once delegated, the user can build sub-hierarchy under the directory,
|
||||
organize processes as it sees fit and further distribute the resources
|
||||
it got from the parent. The limits and other settings of all resource
|
||||
controllers are hierarchical and regardless of what happens in the
|
||||
delegated sub-hierarchy, nothing can escape the resource restrictions
|
||||
imposed by the parent.
|
||||
|
||||
Currently, cgroup doesn't impose any restrictions on the number of
|
||||
cgroups in or nesting depth of a delegated sub-hierarchy; however,
|
||||
this may in the future be limited explicitly.
|
||||
|
||||
|
||||
4-2. Common ancestor rule
|
||||
|
||||
On the unified hierarchy, to write to a "cgroup.procs" file, in
|
||||
addition to the usual write permission to the file and uid match, the
|
||||
writer must also have write access to the "cgroup.procs" file of the
|
||||
common ancestor of the source and destination cgroups. This prevents
|
||||
delegatees from smuggling processes across disjoint sub-hierarchies.
|
||||
|
||||
Let's say cgroups C0 and C1 have been delegated to user U0 who created
|
||||
C00, C01 under C0 and C10 under C1 as follows.
|
||||
|
||||
~~~~~~~~~~~~~ - C0 - C00
|
||||
~ cgroup ~ \ C01
|
||||
~ hierarchy ~
|
||||
~~~~~~~~~~~~~ - C1 - C10
|
||||
|
||||
C0 and C1 are separate entities in terms of resource distribution
|
||||
regardless of their relative positions in the hierarchy. The
|
||||
resources the processes under C0 are entitled to are controlled by
|
||||
C0's ancestors and may be completely different from C1. It's clear
|
||||
that the intention of delegating C0 to U0 is allowing U0 to organize
|
||||
the processes under C0 and further control the distribution of C0's
|
||||
resources.
|
||||
|
||||
On traditional hierarchies, if a task has write access to "tasks" or
|
||||
"cgroup.procs" file of a cgroup and its uid agrees with the target, it
|
||||
can move the target to the cgroup. In the above example, U0 will not
|
||||
only be able to move processes in each sub-hierarchy but also across
|
||||
the two sub-hierarchies, effectively allowing it to violate the
|
||||
organizational and resource restrictions implied by the hierarchical
|
||||
structure above C0 and C1.
|
||||
|
||||
On the unified hierarchy, let's say U0 wants to write the pid of a
|
||||
process which has a matching uid and is currently in C10 into
|
||||
"C00/cgroup.procs". U0 obviously has write access to the file and
|
||||
migration permission on the process; however, the common ancestor of
|
||||
the source cgroup C10 and the destination cgroup C00 is above the
|
||||
points of delegation and U0 would not have write access to its
|
||||
"cgroup.procs" and thus be denied with -EACCES.
|
||||
|
||||
|
||||
5. Other Changes
|
||||
|
||||
5-1. [Un]populated Notification
|
||||
|
||||
cgroup users often need a way to determine when a cgroup's
|
||||
subhierarchy becomes empty so that it can be cleaned up. cgroup
|
||||
currently provides release_agent for it; unfortunately, this mechanism
|
||||
is riddled with issues.
|
||||
|
||||
- It delivers events by forking and execing a userland binary
|
||||
specified as the release_agent. This is a long deprecated method of
|
||||
notification delivery. It's extremely heavy, slow and cumbersome to
|
||||
integrate with larger infrastructure.
|
||||
|
||||
- There is single monitoring point at the root. There's no way to
|
||||
delegate management of a subtree.
|
||||
|
||||
- The event isn't recursive. It triggers when a cgroup doesn't have
|
||||
any tasks or child cgroups. Events for internal nodes trigger only
|
||||
after all children are removed. This again makes it impossible to
|
||||
delegate management of a subtree.
|
||||
|
||||
- Events are filtered from the kernel side. A "notify_on_release"
|
||||
file is used to subscribe to or suppress release events. This is
|
||||
unnecessarily complicated and probably done this way because event
|
||||
delivery itself was expensive.
|
||||
|
||||
Unified hierarchy implements "populated" field in "cgroup.events"
|
||||
interface file which can be used to monitor whether the cgroup's
|
||||
subhierarchy has tasks in it or not. Its value is 0 if there is no
|
||||
task in the cgroup and its descendants; otherwise, 1. poll and
|
||||
[id]notify events are triggered when the value changes.
|
||||
|
||||
This is significantly lighter and simpler and trivially allows
|
||||
delegating management of subhierarchy - subhierarchy monitoring can
|
||||
block further propagation simply by putting itself or another process
|
||||
in the subhierarchy and monitor events that it's interested in from
|
||||
there without interfering with monitoring higher in the tree.
|
||||
|
||||
In unified hierarchy, the release_agent mechanism is no longer
|
||||
supported and the interface files "release_agent" and
|
||||
"notify_on_release" do not exist.
|
||||
|
||||
|
||||
5-2. Other Core Changes
|
||||
|
||||
- None of the mount options is allowed.
|
||||
|
||||
- remount is disallowed.
|
||||
|
||||
- rename(2) is disallowed.
|
||||
|
||||
- The "tasks" file is removed. Everything should at process
|
||||
granularity. Use the "cgroup.procs" file instead.
|
||||
|
||||
- The "cgroup.procs" file is not sorted. pids will be unique unless
|
||||
they got recycled in-between reads.
|
||||
|
||||
- The "cgroup.clone_children" file is removed.
|
||||
|
||||
- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
|
||||
to before exiting. If the cgroup is removed before the zombie is
|
||||
reaped, " (deleted)" is appeneded to the path.
|
||||
|
||||
|
||||
5-3. Controller File Conventions
|
||||
|
||||
5-3-1. Format
|
||||
|
||||
In general, all controller files should be in one of the following
|
||||
formats whenever possible.
|
||||
|
||||
- Values only files
|
||||
|
||||
VAL0 VAL1...\n
|
||||
|
||||
- Flat keyed files
|
||||
|
||||
KEY0 VAL0\n
|
||||
KEY1 VAL1\n
|
||||
...
|
||||
|
||||
- Nested keyed files
|
||||
|
||||
KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
|
||||
KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
|
||||
...
|
||||
|
||||
For a writeable file, the format for writing should generally match
|
||||
reading; however, controllers may allow omitting later fields or
|
||||
implement restricted shortcuts for most common use cases.
|
||||
|
||||
For both flat and nested keyed files, only the values for a single key
|
||||
can be written at a time. For nested keyed files, the sub key pairs
|
||||
may be specified in any order and not all pairs have to be specified.
|
||||
|
||||
|
||||
5-3-2. Control Knobs
|
||||
|
||||
- Settings for a single feature should generally be implemented in a
|
||||
single file.
|
||||
|
||||
- In general, the root cgroup should be exempt from resource control
|
||||
and thus shouldn't have resource control knobs.
|
||||
|
||||
- If a controller implements ratio based resource distribution, the
|
||||
control knob should be named "weight" and have the range [1, 10000]
|
||||
and 100 should be the default value. The values are chosen to allow
|
||||
enough and symmetric bias in both directions while keeping it
|
||||
intuitive (the default is 100%).
|
||||
|
||||
- If a controller implements an absolute resource guarantee and/or
|
||||
limit, the control knobs should be named "min" and "max"
|
||||
respectively. If a controller implements best effort resource
|
||||
gurantee and/or limit, the control knobs should be named "low" and
|
||||
"high" respectively.
|
||||
|
||||
In the above four control files, the special token "max" should be
|
||||
used to represent upward infinity for both reading and writing.
|
||||
|
||||
- If a setting has configurable default value and specific overrides,
|
||||
the default settings should be keyed with "default" and appear as
|
||||
the first entry in the file. Specific entries can use "default" as
|
||||
its value to indicate inheritance of the default value.
|
||||
|
||||
- For events which are not very high frequency, an interface file
|
||||
"events" should be created which lists event key value pairs.
|
||||
Whenever a notifiable event happens, file modified event should be
|
||||
generated on the file.
|
||||
|
||||
|
||||
5-4. Per-Controller Changes
|
||||
|
||||
5-4-1. io
|
||||
|
||||
- blkio is renamed to io. The interface is overhauled anyway. The
|
||||
new name is more in line with the other two major controllers, cpu
|
||||
and memory, and better suited given that it may be used for cgroup
|
||||
writeback without involving block layer.
|
||||
|
||||
- Everything including stat is always hierarchical making separate
|
||||
recursive stat files pointless and, as no internal node can have
|
||||
tasks, leaf weights are meaningless. The operation model is
|
||||
simplified and the interface is overhauled accordingly.
|
||||
|
||||
io.stat
|
||||
|
||||
The stat file. The reported stats are from the point where
|
||||
bio's are issued to request_queue. The stats are counted
|
||||
independent of which policies are enabled. Each line in the
|
||||
file follows the following format. More fields may later be
|
||||
added at the end.
|
||||
|
||||
$MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
|
||||
|
||||
io.weight
|
||||
|
||||
The weight setting, currently only available and effective if
|
||||
cfq-iosched is in use for the target device. The weight is
|
||||
between 1 and 10000 and defaults to 100. The first line
|
||||
always contains the default weight in the following format to
|
||||
use when per-device setting is missing.
|
||||
|
||||
default $WEIGHT
|
||||
|
||||
Subsequent lines list per-device weights of the following
|
||||
format.
|
||||
|
||||
$MAJ:$MIN $WEIGHT
|
||||
|
||||
Writing "$WEIGHT" or "default $WEIGHT" changes the default
|
||||
setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
|
||||
while "$MAJ:$MIN default" clears it.
|
||||
|
||||
This file is available only on non-root cgroups.
|
||||
|
||||
io.max
|
||||
|
||||
The maximum bandwidth and/or iops setting, only available if
|
||||
blk-throttle is enabled. The file is of the following format.
|
||||
|
||||
$MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
|
||||
|
||||
${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
|
||||
read/write IOs per second. "max" indicates no limit. Writing
|
||||
to the file follows the same format but the individual
|
||||
settings may be omitted or specified in any order.
|
||||
|
||||
This file is available only on non-root cgroups.
|
||||
|
||||
|
||||
5-4-2. cpuset
|
||||
|
||||
- Tasks are kept in empty cpusets after hotplug and take on the masks
|
||||
of the nearest non-empty ancestor, instead of being moved to it.
|
||||
|
||||
- A task can be moved into an empty cpuset, and again it takes on the
|
||||
masks of the nearest non-empty ancestor.
|
||||
|
||||
|
||||
5-4-3. memory
|
||||
|
||||
- use_hierarchy is on by default and the cgroup file for the flag is
|
||||
not created.
|
||||
|
||||
- The original lower boundary, the soft limit, is defined as a limit
|
||||
that is per default unset. As a result, the set of cgroups that
|
||||
global reclaim prefers is opt-in, rather than opt-out. The costs
|
||||
for optimizing these mostly negative lookups are so high that the
|
||||
implementation, despite its enormous size, does not even provide the
|
||||
basic desirable behavior. First off, the soft limit has no
|
||||
hierarchical meaning. All configured groups are organized in a
|
||||
global rbtree and treated like equal peers, regardless where they
|
||||
are located in the hierarchy. This makes subtree delegation
|
||||
impossible. Second, the soft limit reclaim pass is so aggressive
|
||||
that it not just introduces high allocation latencies into the
|
||||
system, but also impacts system performance due to overreclaim, to
|
||||
the point where the feature becomes self-defeating.
|
||||
|
||||
The memory.low boundary on the other hand is a top-down allocated
|
||||
reserve. A cgroup enjoys reclaim protection when it and all its
|
||||
ancestors are below their low boundaries, which makes delegation of
|
||||
subtrees possible. Secondly, new cgroups have no reserve per
|
||||
default and in the common case most cgroups are eligible for the
|
||||
preferred reclaim pass. This allows the new low boundary to be
|
||||
efficiently implemented with just a minor addition to the generic
|
||||
reclaim code, without the need for out-of-band data structures and
|
||||
reclaim passes. Because the generic reclaim code considers all
|
||||
cgroups except for the ones running low in the preferred first
|
||||
reclaim pass, overreclaim of individual groups is eliminated as
|
||||
well, resulting in much better overall workload performance.
|
||||
|
||||
- The original high boundary, the hard limit, is defined as a strict
|
||||
limit that can not budge, even if the OOM killer has to be called.
|
||||
But this generally goes against the goal of making the most out of
|
||||
the available memory. The memory consumption of workloads varies
|
||||
during runtime, and that requires users to overcommit. But doing
|
||||
that with a strict upper limit requires either a fairly accurate
|
||||
prediction of the working set size or adding slack to the limit.
|
||||
Since working set size estimation is hard and error prone, and
|
||||
getting it wrong results in OOM kills, most users tend to err on the
|
||||
side of a looser limit and end up wasting precious resources.
|
||||
|
||||
The memory.high boundary on the other hand can be set much more
|
||||
conservatively. When hit, it throttles allocations by forcing them
|
||||
into direct reclaim to work off the excess, but it never invokes the
|
||||
OOM killer. As a result, a high boundary that is chosen too
|
||||
aggressively will not terminate the processes, but instead it will
|
||||
lead to gradual performance degradation. The user can monitor this
|
||||
and make corrections until the minimal memory footprint that still
|
||||
gives acceptable performance is found.
|
||||
|
||||
In extreme cases, with many concurrent allocations and a complete
|
||||
breakdown of reclaim progress within the group, the high boundary
|
||||
can be exceeded. But even then it's mostly better to satisfy the
|
||||
allocation from the slack available in other groups or the rest of
|
||||
the system than killing the group. Otherwise, memory.max is there
|
||||
to limit this type of spillover and ultimately contain buggy or even
|
||||
malicious applications.
|
||||
|
||||
- The original control file names are unwieldy and inconsistent in
|
||||
many different ways. For example, the upper boundary hit count is
|
||||
exported in the memory.failcnt file, but an OOM event count has to
|
||||
be manually counted by listening to memory.oom_control events, and
|
||||
lower boundary / soft limit events have to be counted by first
|
||||
setting a threshold for that value and then counting those events.
|
||||
Also, usage and limit files encode their units in the filename.
|
||||
That makes the filenames very long, even though this is not
|
||||
information that a user needs to be reminded of every time they type
|
||||
out those names.
|
||||
|
||||
To address these naming issues, as well as to signal clearly that
|
||||
the new interface carries a new configuration model, the naming
|
||||
conventions in it necessarily differ from the old interface.
|
||||
|
||||
- The original limit files indicate the state of an unset limit with a
|
||||
Very High Number, and a configured limit can be unset by echoing -1
|
||||
into those files. But that very high number is implementation and
|
||||
architecture dependent and not very descriptive. And while -1 can
|
||||
be understood as an underflow into the highest possible value, -2 or
|
||||
-10M etc. do not work, so it's not consistent.
|
||||
|
||||
memory.low, memory.high, and memory.max will use the string "max" to
|
||||
indicate and set the highest possible value.
|
||||
|
||||
6. Planned Changes
|
||||
|
||||
6-1. CAP for resource control
|
||||
|
||||
Unified hierarchy will require one of the capabilities(7), which is
|
||||
yet to be decided, for all resource control related knobs. Process
|
||||
organization operations - creation of sub-cgroups and migration of
|
||||
processes in sub-hierarchies may be delegated by changing the
|
||||
ownership and/or permissions on the cgroup directory and
|
||||
"cgroup.procs" interface file; however, all operations which affect
|
||||
resource control - writes to a "cgroup.subtree_control" file or any
|
||||
controller-specific knobs - will require an explicit CAP privilege.
|
||||
|
||||
This, in part, is to prevent the cgroup interface from being
|
||||
inadvertently promoted to programmable API used by non-privileged
|
||||
binaries. cgroup exposes various aspects of the system in ways which
|
||||
aren't properly abstracted for direct consumption by regular programs.
|
||||
This is an administration interface much closer to sysctl knobs than
|
||||
system calls. Even the basic access model, being filesystem path
|
||||
based, isn't suitable for direct consumption. There's no way to
|
||||
access "my cgroup" in a race-free way or make multiple operations
|
||||
atomic against migration to another cgroup.
|
||||
|
||||
Another aspect is that, for better or for worse, the cgroup interface
|
||||
goes through far less scrutiny than regular interfaces for
|
||||
unprivileged userland. The upside is that cgroup is able to expose
|
||||
useful features which may not be suitable for general consumption in a
|
||||
reasonable time frame. It provides a relatively short path between
|
||||
internal details and userland-visible interface. Of course, this
|
||||
shortcut comes with high risk. We go through what we go through for
|
||||
general kernel APIs for good reasons. It may end up leaking internal
|
||||
details in a way which can exert significant pain by locking the
|
||||
kernel into a contract that can't be maintained in a reasonable
|
||||
manner.
|
||||
|
||||
Also, due to the specific nature, cgroup and its controllers don't
|
||||
tend to attract attention from a wide scope of developers. cgroup's
|
||||
short history is already fraught with severely mis-designed
|
||||
interfaces, unnecessary commitments to and exposing of internal
|
||||
details, broken and dangerous implementations of various features.
|
||||
|
||||
Keeping cgroup as an administration interface is both advantageous for
|
||||
its role and imperative given its nature. Some of the cgroup features
|
||||
may make sense for unprivileged access. If deemed justified, those
|
||||
must be further abstracted and implemented as a different interface,
|
||||
be it a system call or process-private filesystem, and survive through
|
||||
the scrutiny that any interface for general consumption is required to
|
||||
go through.
|
||||
|
||||
Requiring CAP is not a complete solution but should serve as a
|
||||
significant deterrent against spraying cgroup usages in non-privileged
|
||||
programs.
|
File diff suppressed because it is too large
Load Diff
Loading…
Reference in New Issue