mirror of https://gitee.com/openkylin/linux.git
Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Documentation updates and the addition of cgroup_parse_float() which will be used by new controllers including blk-iocost" * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: convert docs to ReST and rename to *.rst cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS cgroup: add cgroup_parse_float()
This commit is contained in:
commit
92c1d65221
|
@ -705,6 +705,12 @@ Conventions
|
||||||
informational files on the root cgroup which end up showing global
|
informational files on the root cgroup which end up showing global
|
||||||
information available elsewhere shouldn't exist.
|
information available elsewhere shouldn't exist.
|
||||||
|
|
||||||
|
- The default time unit is microseconds. If a different unit is ever
|
||||||
|
used, an explicit unit suffix must be present.
|
||||||
|
|
||||||
|
- A parts-per quantity should use a percentage decimal with at least
|
||||||
|
two digit fractional part - e.g. 13.40.
|
||||||
|
|
||||||
- If a controller implements weight based resource distribution, its
|
- If a controller implements weight based resource distribution, its
|
||||||
interface file should be named "weight" and have the range [1,
|
interface file should be named "weight" and have the range [1,
|
||||||
10000] with 100 as the default. The values are chosen to allow
|
10000] with 100 as the default. The values are chosen to allow
|
||||||
|
|
|
@ -241,7 +241,7 @@ Guest mitigation mechanisms
|
||||||
For further information about confining guests to a single or to a group
|
For further information about confining guests to a single or to a group
|
||||||
of cores consult the cpusets documentation:
|
of cores consult the cpusets documentation:
|
||||||
|
|
||||||
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt
|
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst
|
||||||
|
|
||||||
.. _interrupt_isolation:
|
.. _interrupt_isolation:
|
||||||
|
|
||||||
|
|
|
@ -4084,7 +4084,7 @@
|
||||||
|
|
||||||
relax_domain_level=
|
relax_domain_level=
|
||||||
[KNL, SMP] Set scheduler's default relax_domain_level.
|
[KNL, SMP] Set scheduler's default relax_domain_level.
|
||||||
See Documentation/cgroup-v1/cpusets.txt.
|
See Documentation/cgroup-v1/cpusets.rst.
|
||||||
|
|
||||||
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
|
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
|
||||||
Format: <base1>,<size1>[,<base2>,<size2>,...]
|
Format: <base1>,<size1>[,<base2>,<size2>,...]
|
||||||
|
@ -4594,7 +4594,7 @@
|
||||||
swapaccount=[0|1]
|
swapaccount=[0|1]
|
||||||
[KNL] Enable accounting of swap in memory resource
|
[KNL] Enable accounting of swap in memory resource
|
||||||
controller if no parameter or 1 is given or disable
|
controller if no parameter or 1 is given or disable
|
||||||
it if 0 is given (See Documentation/cgroup-v1/memory.txt)
|
it if 0 is given (See Documentation/cgroup-v1/memory.rst)
|
||||||
|
|
||||||
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
|
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
|
||||||
Format: { <int> | force | noforce }
|
Format: { <int> | force | noforce }
|
||||||
|
|
|
@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
|
||||||
support.
|
support.
|
||||||
|
|
||||||
Memory policies should not be confused with cpusets
|
Memory policies should not be confused with cpusets
|
||||||
(``Documentation/cgroup-v1/cpusets.txt``)
|
(``Documentation/cgroup-v1/cpusets.rst``)
|
||||||
which is an administrative mechanism for restricting the nodes from which
|
which is an administrative mechanism for restricting the nodes from which
|
||||||
memory may be allocated by a set of processes. Memory policies are a
|
memory may be allocated by a set of processes. Memory policies are a
|
||||||
programming interface that a NUMA-aware application can take advantage of. When
|
programming interface that a NUMA-aware application can take advantage of. When
|
||||||
|
|
|
@ -539,7 +539,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files
|
||||||
created, and kept up-to-date by bfq, depends on whether
|
created, and kept up-to-date by bfq, depends on whether
|
||||||
CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
|
CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
|
||||||
the stat files documented in
|
the stat files documented in
|
||||||
Documentation/cgroup-v1/blkio-controller.txt. If, instead,
|
Documentation/cgroup-v1/blkio-controller.rst. If, instead,
|
||||||
CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
|
CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
|
||||||
blkio.bfq.io_service_bytes
|
blkio.bfq.io_service_bytes
|
||||||
blkio.bfq.io_service_bytes_recursive
|
blkio.bfq.io_service_bytes_recursive
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
Block IO Controller
|
===================
|
||||||
===================
|
Block IO Controller
|
||||||
|
===================
|
||||||
|
|
||||||
Overview
|
Overview
|
||||||
========
|
========
|
||||||
cgroup subsys "blkio" implements the block io controller. There seems to be
|
cgroup subsys "blkio" implements the block io controller. There seems to be
|
||||||
|
@ -17,24 +19,27 @@ HOWTO
|
||||||
=====
|
=====
|
||||||
Throttling/Upper Limit policy
|
Throttling/Upper Limit policy
|
||||||
-----------------------------
|
-----------------------------
|
||||||
- Enable Block IO controller
|
- Enable Block IO controller::
|
||||||
|
|
||||||
CONFIG_BLK_CGROUP=y
|
CONFIG_BLK_CGROUP=y
|
||||||
|
|
||||||
- Enable throttling in block layer
|
- Enable throttling in block layer::
|
||||||
|
|
||||||
CONFIG_BLK_DEV_THROTTLING=y
|
CONFIG_BLK_DEV_THROTTLING=y
|
||||||
|
|
||||||
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)
|
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
|
||||||
|
|
||||||
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
|
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
|
||||||
|
|
||||||
- Specify a bandwidth rate on particular device for root group. The format
|
- Specify a bandwidth rate on particular device for root group. The format
|
||||||
for policy is "<major>:<minor> <bytes_per_second>".
|
for policy is "<major>:<minor> <bytes_per_second>"::
|
||||||
|
|
||||||
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
|
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
|
||||||
|
|
||||||
Above will put a limit of 1MB/second on reads happening for root group
|
Above will put a limit of 1MB/second on reads happening for root group
|
||||||
on device having major/minor number 8:16.
|
on device having major/minor number 8:16.
|
||||||
|
|
||||||
- Run dd to read a file and see if rate is throttled to 1MB/s or not.
|
- Run dd to read a file and see if rate is throttled to 1MB/s or not::
|
||||||
|
|
||||||
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
|
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
|
||||||
1024+0 records in
|
1024+0 records in
|
||||||
|
@ -51,7 +56,7 @@ throttling's hierarchy support is enabled iff "sane_behavior" is
|
||||||
enabled from cgroup side, which currently is a development option and
|
enabled from cgroup side, which currently is a development option and
|
||||||
not publicly available.
|
not publicly available.
|
||||||
|
|
||||||
If somebody created a hierarchy like as follows.
|
If somebody created a hierarchy like as follows::
|
||||||
|
|
||||||
root
|
root
|
||||||
/ \
|
/ \
|
||||||
|
@ -66,7 +71,7 @@ directly generated by tasks in that cgroup.
|
||||||
|
|
||||||
Throttling without "sane_behavior" enabled from cgroup side will
|
Throttling without "sane_behavior" enabled from cgroup side will
|
||||||
practically treat all groups at same level as if it looks like the
|
practically treat all groups at same level as if it looks like the
|
||||||
following.
|
following::
|
||||||
|
|
||||||
pivot
|
pivot
|
||||||
/ / \ \
|
/ / \ \
|
||||||
|
@ -99,27 +104,31 @@ Proportional weight policy files
|
||||||
These rules override the default value of group weight as specified
|
These rules override the default value of group weight as specified
|
||||||
by blkio.weight.
|
by blkio.weight.
|
||||||
|
|
||||||
Following is the format.
|
Following is the format::
|
||||||
|
|
||||||
# echo dev_maj:dev_minor weight > blkio.weight_device
|
# echo dev_maj:dev_minor weight > blkio.weight_device
|
||||||
Configure weight=300 on /dev/sdb (8:16) in this cgroup
|
|
||||||
# echo 8:16 300 > blkio.weight_device
|
|
||||||
# cat blkio.weight_device
|
|
||||||
dev weight
|
|
||||||
8:16 300
|
|
||||||
|
|
||||||
Configure weight=500 on /dev/sda (8:0) in this cgroup
|
Configure weight=300 on /dev/sdb (8:16) in this cgroup::
|
||||||
# echo 8:0 500 > blkio.weight_device
|
|
||||||
# cat blkio.weight_device
|
|
||||||
dev weight
|
|
||||||
8:0 500
|
|
||||||
8:16 300
|
|
||||||
|
|
||||||
Remove specific weight for /dev/sda in this cgroup
|
# echo 8:16 300 > blkio.weight_device
|
||||||
# echo 8:0 0 > blkio.weight_device
|
# cat blkio.weight_device
|
||||||
# cat blkio.weight_device
|
dev weight
|
||||||
dev weight
|
8:16 300
|
||||||
8:16 300
|
|
||||||
|
Configure weight=500 on /dev/sda (8:0) in this cgroup::
|
||||||
|
|
||||||
|
# echo 8:0 500 > blkio.weight_device
|
||||||
|
# cat blkio.weight_device
|
||||||
|
dev weight
|
||||||
|
8:0 500
|
||||||
|
8:16 300
|
||||||
|
|
||||||
|
Remove specific weight for /dev/sda in this cgroup::
|
||||||
|
|
||||||
|
# echo 8:0 0 > blkio.weight_device
|
||||||
|
# cat blkio.weight_device
|
||||||
|
dev weight
|
||||||
|
8:16 300
|
||||||
|
|
||||||
- blkio.leaf_weight[_device]
|
- blkio.leaf_weight[_device]
|
||||||
- Equivalents of blkio.weight[_device] for the purpose of
|
- Equivalents of blkio.weight[_device] for the purpose of
|
||||||
|
@ -244,30 +253,30 @@ Throttling/Upper limit policy files
|
||||||
- blkio.throttle.read_bps_device
|
- blkio.throttle.read_bps_device
|
||||||
- Specifies upper limit on READ rate from the device. IO rate is
|
- Specifies upper limit on READ rate from the device. IO rate is
|
||||||
specified in bytes per second. Rules are per device. Following is
|
specified in bytes per second. Rules are per device. Following is
|
||||||
the format.
|
the format::
|
||||||
|
|
||||||
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
|
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
|
||||||
|
|
||||||
- blkio.throttle.write_bps_device
|
- blkio.throttle.write_bps_device
|
||||||
- Specifies upper limit on WRITE rate to the device. IO rate is
|
- Specifies upper limit on WRITE rate to the device. IO rate is
|
||||||
specified in bytes per second. Rules are per device. Following is
|
specified in bytes per second. Rules are per device. Following is
|
||||||
the format.
|
the format::
|
||||||
|
|
||||||
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
|
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
|
||||||
|
|
||||||
- blkio.throttle.read_iops_device
|
- blkio.throttle.read_iops_device
|
||||||
- Specifies upper limit on READ rate from the device. IO rate is
|
- Specifies upper limit on READ rate from the device. IO rate is
|
||||||
specified in IO per second. Rules are per device. Following is
|
specified in IO per second. Rules are per device. Following is
|
||||||
the format.
|
the format::
|
||||||
|
|
||||||
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
|
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
|
||||||
|
|
||||||
- blkio.throttle.write_iops_device
|
- blkio.throttle.write_iops_device
|
||||||
- Specifies upper limit on WRITE rate to the device. IO rate is
|
- Specifies upper limit on WRITE rate to the device. IO rate is
|
||||||
specified in io per second. Rules are per device. Following is
|
specified in io per second. Rules are per device. Following is
|
||||||
the format.
|
the format::
|
||||||
|
|
||||||
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
|
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
|
||||||
|
|
||||||
Note: If both BW and IOPS rules are specified for a device, then IO is
|
Note: If both BW and IOPS rules are specified for a device, then IO is
|
||||||
subjected to both the constraints.
|
subjected to both the constraints.
|
|
@ -1,35 +1,39 @@
|
||||||
CGROUPS
|
==============
|
||||||
-------
|
Control Groups
|
||||||
|
==============
|
||||||
|
|
||||||
Written by Paul Menage <menage@google.com> based on
|
Written by Paul Menage <menage@google.com> based on
|
||||||
Documentation/cgroup-v1/cpusets.txt
|
Documentation/cgroup-v1/cpusets.rst
|
||||||
|
|
||||||
Original copyright statements from cpusets.txt:
|
Original copyright statements from cpusets.txt:
|
||||||
|
|
||||||
Portions Copyright (C) 2004 BULL SA.
|
Portions Copyright (C) 2004 BULL SA.
|
||||||
|
|
||||||
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
||||||
|
|
||||||
Modified by Paul Jackson <pj@sgi.com>
|
Modified by Paul Jackson <pj@sgi.com>
|
||||||
|
|
||||||
Modified by Christoph Lameter <cl@linux.com>
|
Modified by Christoph Lameter <cl@linux.com>
|
||||||
|
|
||||||
CONTENTS:
|
.. CONTENTS:
|
||||||
=========
|
|
||||||
|
|
||||||
1. Control Groups
|
1. Control Groups
|
||||||
1.1 What are cgroups ?
|
1.1 What are cgroups ?
|
||||||
1.2 Why are cgroups needed ?
|
1.2 Why are cgroups needed ?
|
||||||
1.3 How are cgroups implemented ?
|
1.3 How are cgroups implemented ?
|
||||||
1.4 What does notify_on_release do ?
|
1.4 What does notify_on_release do ?
|
||||||
1.5 What does clone_children do ?
|
1.5 What does clone_children do ?
|
||||||
1.6 How do I use cgroups ?
|
1.6 How do I use cgroups ?
|
||||||
2. Usage Examples and Syntax
|
2. Usage Examples and Syntax
|
||||||
2.1 Basic Usage
|
2.1 Basic Usage
|
||||||
2.2 Attaching processes
|
2.2 Attaching processes
|
||||||
2.3 Mounting hierarchies by name
|
2.3 Mounting hierarchies by name
|
||||||
3. Kernel API
|
3. Kernel API
|
||||||
3.1 Overview
|
3.1 Overview
|
||||||
3.2 Synchronization
|
3.2 Synchronization
|
||||||
3.3 Subsystem API
|
3.3 Subsystem API
|
||||||
4. Extended attributes usage
|
4. Extended attributes usage
|
||||||
5. Questions
|
5. Questions
|
||||||
|
|
||||||
1. Control Groups
|
1. Control Groups
|
||||||
=================
|
=================
|
||||||
|
@ -72,7 +76,7 @@ On their own, the only use for cgroups is for simple job
|
||||||
tracking. The intention is that other subsystems hook into the generic
|
tracking. The intention is that other subsystems hook into the generic
|
||||||
cgroup support to provide new attributes for cgroups, such as
|
cgroup support to provide new attributes for cgroups, such as
|
||||||
accounting/limiting the resources which processes in a cgroup can
|
accounting/limiting the resources which processes in a cgroup can
|
||||||
access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow
|
access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow
|
||||||
you to associate a set of CPUs and a set of memory nodes with the
|
you to associate a set of CPUs and a set of memory nodes with the
|
||||||
tasks in each cgroup.
|
tasks in each cgroup.
|
||||||
|
|
||||||
|
@ -108,7 +112,7 @@ As an example of a scenario (originally proposed by vatsa@in.ibm.com)
|
||||||
that can benefit from multiple hierarchies, consider a large
|
that can benefit from multiple hierarchies, consider a large
|
||||||
university server with various users - students, professors, system
|
university server with various users - students, professors, system
|
||||||
tasks etc. The resource planning for this server could be along the
|
tasks etc. The resource planning for this server could be along the
|
||||||
following lines:
|
following lines::
|
||||||
|
|
||||||
CPU : "Top cpuset"
|
CPU : "Top cpuset"
|
||||||
/ \
|
/ \
|
||||||
|
@ -136,7 +140,7 @@ depending on who launched it (prof/student).
|
||||||
With the ability to classify tasks differently for different resources
|
With the ability to classify tasks differently for different resources
|
||||||
(by putting those resource subsystems in different hierarchies),
|
(by putting those resource subsystems in different hierarchies),
|
||||||
the admin can easily set up a script which receives exec notifications
|
the admin can easily set up a script which receives exec notifications
|
||||||
and depending on who is launching the browser he can
|
and depending on who is launching the browser he can::
|
||||||
|
|
||||||
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
|
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
|
||||||
|
|
||||||
|
@ -151,7 +155,7 @@ wants to do online gaming :)) OR give one of the student's simulation
|
||||||
apps enhanced CPU power.
|
apps enhanced CPU power.
|
||||||
|
|
||||||
With ability to write PIDs directly to resource classes, it's just a
|
With ability to write PIDs directly to resource classes, it's just a
|
||||||
matter of:
|
matter of::
|
||||||
|
|
||||||
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
|
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks
|
||||||
(after some time)
|
(after some time)
|
||||||
|
@ -306,7 +310,7 @@ configuration from the parent during initialization.
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
To start a new job that is to be contained within a cgroup, using
|
To start a new job that is to be contained within a cgroup, using
|
||||||
the "cpuset" cgroup subsystem, the steps are something like:
|
the "cpuset" cgroup subsystem, the steps are something like::
|
||||||
|
|
||||||
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
|
1) mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
2) mkdir /sys/fs/cgroup/cpuset
|
2) mkdir /sys/fs/cgroup/cpuset
|
||||||
|
@ -320,7 +324,7 @@ the "cpuset" cgroup subsystem, the steps are something like:
|
||||||
|
|
||||||
For example, the following sequence of commands will setup a cgroup
|
For example, the following sequence of commands will setup a cgroup
|
||||||
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||||
and then start a subshell 'sh' in that cgroup:
|
and then start a subshell 'sh' in that cgroup::
|
||||||
|
|
||||||
mount -t tmpfs cgroup_root /sys/fs/cgroup
|
mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
mkdir /sys/fs/cgroup/cpuset
|
mkdir /sys/fs/cgroup/cpuset
|
||||||
|
@ -345,8 +349,9 @@ and then start a subshell 'sh' in that cgroup:
|
||||||
Creating, modifying, using cgroups can be done through the cgroup
|
Creating, modifying, using cgroups can be done through the cgroup
|
||||||
virtual filesystem.
|
virtual filesystem.
|
||||||
|
|
||||||
To mount a cgroup hierarchy with all available subsystems, type:
|
To mount a cgroup hierarchy with all available subsystems, type::
|
||||||
# mount -t cgroup xxx /sys/fs/cgroup
|
|
||||||
|
# mount -t cgroup xxx /sys/fs/cgroup
|
||||||
|
|
||||||
The "xxx" is not interpreted by the cgroup code, but will appear in
|
The "xxx" is not interpreted by the cgroup code, but will appear in
|
||||||
/proc/mounts so may be any useful identifying string that you like.
|
/proc/mounts so may be any useful identifying string that you like.
|
||||||
|
@ -355,18 +360,19 @@ Note: Some subsystems do not work without some user input first. For instance,
|
||||||
if cpusets are enabled the user will have to populate the cpus and mems files
|
if cpusets are enabled the user will have to populate the cpus and mems files
|
||||||
for each new cgroup created before that group can be used.
|
for each new cgroup created before that group can be used.
|
||||||
|
|
||||||
As explained in section `1.2 Why are cgroups needed?' you should create
|
As explained in section `1.2 Why are cgroups needed?` you should create
|
||||||
different hierarchies of cgroups for each single resource or group of
|
different hierarchies of cgroups for each single resource or group of
|
||||||
resources you want to control. Therefore, you should mount a tmpfs on
|
resources you want to control. Therefore, you should mount a tmpfs on
|
||||||
/sys/fs/cgroup and create directories for each cgroup resource or resource
|
/sys/fs/cgroup and create directories for each cgroup resource or resource
|
||||||
group.
|
group::
|
||||||
|
|
||||||
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
# mount -t tmpfs cgroup_root /sys/fs/cgroup
|
||||||
# mkdir /sys/fs/cgroup/rg1
|
# mkdir /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
To mount a cgroup hierarchy with just the cpuset and memory
|
To mount a cgroup hierarchy with just the cpuset and memory
|
||||||
subsystems, type:
|
subsystems, type::
|
||||||
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
|
|
||||||
|
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
While remounting cgroups is currently supported, it is not recommend
|
While remounting cgroups is currently supported, it is not recommend
|
||||||
to use it. Remounting allows changing bound subsystems and
|
to use it. Remounting allows changing bound subsystems and
|
||||||
|
@ -375,9 +381,10 @@ hierarchy is empty and release_agent itself should be replaced with
|
||||||
conventional fsnotify. The support for remounting will be removed in
|
conventional fsnotify. The support for remounting will be removed in
|
||||||
the future.
|
the future.
|
||||||
|
|
||||||
To Specify a hierarchy's release_agent:
|
To Specify a hierarchy's release_agent::
|
||||||
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
|
|
||||||
xxx /sys/fs/cgroup/rg1
|
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
|
||||||
|
xxx /sys/fs/cgroup/rg1
|
||||||
|
|
||||||
Note that specifying 'release_agent' more than once will return failure.
|
Note that specifying 'release_agent' more than once will return failure.
|
||||||
|
|
||||||
|
@ -390,32 +397,39 @@ Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
|
||||||
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
|
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
|
||||||
is the cgroup that holds the whole system.
|
is the cgroup that holds the whole system.
|
||||||
|
|
||||||
If you want to change the value of release_agent:
|
If you want to change the value of release_agent::
|
||||||
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
|
|
||||||
|
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
|
||||||
|
|
||||||
It can also be changed via remount.
|
It can also be changed via remount.
|
||||||
|
|
||||||
If you want to create a new cgroup under /sys/fs/cgroup/rg1:
|
If you want to create a new cgroup under /sys/fs/cgroup/rg1::
|
||||||
# cd /sys/fs/cgroup/rg1
|
|
||||||
# mkdir my_cgroup
|
|
||||||
|
|
||||||
Now you want to do something with this cgroup.
|
# cd /sys/fs/cgroup/rg1
|
||||||
# cd my_cgroup
|
# mkdir my_cgroup
|
||||||
|
|
||||||
In this directory you can find several files:
|
Now you want to do something with this cgroup:
|
||||||
# ls
|
|
||||||
cgroup.procs notify_on_release tasks
|
|
||||||
(plus whatever files added by the attached subsystems)
|
|
||||||
|
|
||||||
Now attach your shell to this cgroup:
|
# cd my_cgroup
|
||||||
# /bin/echo $$ > tasks
|
|
||||||
|
In this directory you can find several files::
|
||||||
|
|
||||||
|
# ls
|
||||||
|
cgroup.procs notify_on_release tasks
|
||||||
|
(plus whatever files added by the attached subsystems)
|
||||||
|
|
||||||
|
Now attach your shell to this cgroup::
|
||||||
|
|
||||||
|
# /bin/echo $$ > tasks
|
||||||
|
|
||||||
You can also create cgroups inside your cgroup by using mkdir in this
|
You can also create cgroups inside your cgroup by using mkdir in this
|
||||||
directory.
|
directory::
|
||||||
# mkdir my_sub_cs
|
|
||||||
|
|
||||||
To remove a cgroup, just use rmdir:
|
# mkdir my_sub_cs
|
||||||
# rmdir my_sub_cs
|
|
||||||
|
To remove a cgroup, just use rmdir::
|
||||||
|
|
||||||
|
# rmdir my_sub_cs
|
||||||
|
|
||||||
This will fail if the cgroup is in use (has cgroups inside, or
|
This will fail if the cgroup is in use (has cgroups inside, or
|
||||||
has processes attached, or is held alive by other subsystem-specific
|
has processes attached, or is held alive by other subsystem-specific
|
||||||
|
@ -424,19 +438,21 @@ reference).
|
||||||
2.2 Attaching processes
|
2.2 Attaching processes
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
# /bin/echo PID > tasks
|
::
|
||||||
|
|
||||||
|
# /bin/echo PID > tasks
|
||||||
|
|
||||||
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
||||||
If you have several tasks to attach, you have to do it one after another:
|
If you have several tasks to attach, you have to do it one after another::
|
||||||
|
|
||||||
# /bin/echo PID1 > tasks
|
# /bin/echo PID1 > tasks
|
||||||
# /bin/echo PID2 > tasks
|
# /bin/echo PID2 > tasks
|
||||||
...
|
...
|
||||||
# /bin/echo PIDn > tasks
|
# /bin/echo PIDn > tasks
|
||||||
|
|
||||||
You can attach the current shell task by echoing 0:
|
You can attach the current shell task by echoing 0::
|
||||||
|
|
||||||
# echo 0 > tasks
|
# echo 0 > tasks
|
||||||
|
|
||||||
You can use the cgroup.procs file instead of the tasks file to move all
|
You can use the cgroup.procs file instead of the tasks file to move all
|
||||||
threads in a threadgroup at once. Echoing the PID of any task in a
|
threads in a threadgroup at once. Echoing the PID of any task in a
|
||||||
|
@ -529,7 +545,7 @@ Each subsystem may export the following methods. The only mandatory
|
||||||
methods are css_alloc/free. Any others that are null are presumed to
|
methods are css_alloc/free. Any others that are null are presumed to
|
||||||
be successful no-ops.
|
be successful no-ops.
|
||||||
|
|
||||||
struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)
|
``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called to allocate a subsystem state object for a cgroup. The
|
Called to allocate a subsystem state object for a cgroup. The
|
||||||
|
@ -544,7 +560,7 @@ identified by the passed cgroup object having a NULL parent (since
|
||||||
it's the root of the hierarchy) and may be an appropriate place for
|
it's the root of the hierarchy) and may be an appropriate place for
|
||||||
initialization code.
|
initialization code.
|
||||||
|
|
||||||
int css_online(struct cgroup *cgrp)
|
``int css_online(struct cgroup *cgrp)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called after @cgrp successfully completed all allocations and made
|
Called after @cgrp successfully completed all allocations and made
|
||||||
|
@ -554,7 +570,7 @@ callback can be used to implement reliable state sharing and
|
||||||
propagation along the hierarchy. See the comment on
|
propagation along the hierarchy. See the comment on
|
||||||
cgroup_for_each_descendant_pre() for details.
|
cgroup_for_each_descendant_pre() for details.
|
||||||
|
|
||||||
void css_offline(struct cgroup *cgrp);
|
``void css_offline(struct cgroup *cgrp);``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
This is the counterpart of css_online() and called iff css_online()
|
This is the counterpart of css_online() and called iff css_online()
|
||||||
|
@ -564,7 +580,7 @@ all references it's holding on @cgrp. When all references are dropped,
|
||||||
cgroup removal will proceed to the next step - css_free(). After this
|
cgroup removal will proceed to the next step - css_free(). After this
|
||||||
callback, @cgrp should be considered dead to the subsystem.
|
callback, @cgrp should be considered dead to the subsystem.
|
||||||
|
|
||||||
void css_free(struct cgroup *cgrp)
|
``void css_free(struct cgroup *cgrp)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
The cgroup system is about to free @cgrp; the subsystem should free
|
The cgroup system is about to free @cgrp; the subsystem should free
|
||||||
|
@ -573,7 +589,7 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
|
||||||
be called for a newly-created cgroup if an error occurs after this
|
be called for a newly-created cgroup if an error occurs after this
|
||||||
subsystem's create() method has been called for the new cgroup).
|
subsystem's create() method has been called for the new cgroup).
|
||||||
|
|
||||||
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called prior to moving one or more tasks into a cgroup; if the
|
Called prior to moving one or more tasks into a cgroup; if the
|
||||||
|
@ -594,7 +610,7 @@ fork. If this method returns 0 (success) then this should remain valid
|
||||||
while the caller holds cgroup_mutex and it is ensured that either
|
while the caller holds cgroup_mutex and it is ensured that either
|
||||||
attach() or cancel_attach() will be called in future.
|
attach() or cancel_attach() will be called in future.
|
||||||
|
|
||||||
void css_reset(struct cgroup_subsys_state *css)
|
``void css_reset(struct cgroup_subsys_state *css)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
An optional operation which should restore @css's configuration to the
|
An optional operation which should restore @css's configuration to the
|
||||||
|
@ -608,7 +624,7 @@ This prevents unexpected resource control from a hidden css and
|
||||||
ensures that the configuration is in the initial state when it is made
|
ensures that the configuration is in the initial state when it is made
|
||||||
visible again later.
|
visible again later.
|
||||||
|
|
||||||
void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called when a task attach operation has failed after can_attach() has succeeded.
|
Called when a task attach operation has failed after can_attach() has succeeded.
|
||||||
|
@ -617,26 +633,26 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
|
||||||
This will be called only about subsystems whose can_attach() operation have
|
This will be called only about subsystems whose can_attach() operation have
|
||||||
succeeded. The parameters are identical to can_attach().
|
succeeded. The parameters are identical to can_attach().
|
||||||
|
|
||||||
void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
|
``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called after the task has been attached to the cgroup, to allow any
|
Called after the task has been attached to the cgroup, to allow any
|
||||||
post-attachment activity that requires memory allocations or blocking.
|
post-attachment activity that requires memory allocations or blocking.
|
||||||
The parameters are identical to can_attach().
|
The parameters are identical to can_attach().
|
||||||
|
|
||||||
void fork(struct task_struct *task)
|
``void fork(struct task_struct *task)``
|
||||||
|
|
||||||
Called when a task is forked into a cgroup.
|
Called when a task is forked into a cgroup.
|
||||||
|
|
||||||
void exit(struct task_struct *task)
|
``void exit(struct task_struct *task)``
|
||||||
|
|
||||||
Called during task exit.
|
Called during task exit.
|
||||||
|
|
||||||
void free(struct task_struct *task)
|
``void free(struct task_struct *task)``
|
||||||
|
|
||||||
Called when the task_struct is freed.
|
Called when the task_struct is freed.
|
||||||
|
|
||||||
void bind(struct cgroup *root)
|
``void bind(struct cgroup *root)``
|
||||||
(cgroup_mutex held by caller)
|
(cgroup_mutex held by caller)
|
||||||
|
|
||||||
Called when a cgroup subsystem is rebound to a different hierarchy
|
Called when a cgroup subsystem is rebound to a different hierarchy
|
||||||
|
@ -649,6 +665,7 @@ that is being created/destroyed (and hence has no sub-cgroups).
|
||||||
|
|
||||||
cgroup filesystem supports certain types of extended attributes in its
|
cgroup filesystem supports certain types of extended attributes in its
|
||||||
directories and files. The current supported types are:
|
directories and files. The current supported types are:
|
||||||
|
|
||||||
- Trusted (XATTR_TRUSTED)
|
- Trusted (XATTR_TRUSTED)
|
||||||
- Security (XATTR_SECURITY)
|
- Security (XATTR_SECURITY)
|
||||||
|
|
||||||
|
@ -666,12 +683,13 @@ in containers and systemd for assorted meta data like main PID in a cgroup
|
||||||
5. Questions
|
5. Questions
|
||||||
============
|
============
|
||||||
|
|
||||||
Q: what's up with this '/bin/echo' ?
|
::
|
||||||
A: bash's builtin 'echo' command does not check calls to write() against
|
|
||||||
errors. If you use it in the cgroup file system, you won't be
|
|
||||||
able to tell whether a command succeeded or failed.
|
|
||||||
|
|
||||||
Q: When I attach processes, only the first of the line gets really attached !
|
Q: what's up with this '/bin/echo' ?
|
||||||
A: We can only return one error code per call to write(). So you should also
|
A: bash's builtin 'echo' command does not check calls to write() against
|
||||||
put only ONE PID.
|
errors. If you use it in the cgroup file system, you won't be
|
||||||
|
able to tell whether a command succeeded or failed.
|
||||||
|
|
||||||
|
Q: When I attach processes, only the first of the line gets really attached !
|
||||||
|
A: We can only return one error code per call to write(). So you should also
|
||||||
|
put only ONE PID.
|
|
@ -1,5 +1,6 @@
|
||||||
|
=========================
|
||||||
CPU Accounting Controller
|
CPU Accounting Controller
|
||||||
-------------------------
|
=========================
|
||||||
|
|
||||||
The CPU accounting controller is used to group tasks using cgroups and
|
The CPU accounting controller is used to group tasks using cgroups and
|
||||||
account the CPU usage of these groups of tasks.
|
account the CPU usage of these groups of tasks.
|
||||||
|
@ -8,9 +9,9 @@ The CPU accounting controller supports multi-hierarchy groups. An accounting
|
||||||
group accumulates the CPU usage of all of its child groups and the tasks
|
group accumulates the CPU usage of all of its child groups and the tasks
|
||||||
directly present in its group.
|
directly present in its group.
|
||||||
|
|
||||||
Accounting groups can be created by first mounting the cgroup filesystem.
|
Accounting groups can be created by first mounting the cgroup filesystem::
|
||||||
|
|
||||||
# mount -t cgroup -ocpuacct none /sys/fs/cgroup
|
# mount -t cgroup -ocpuacct none /sys/fs/cgroup
|
||||||
|
|
||||||
With the above step, the initial or the parent accounting group becomes
|
With the above step, the initial or the parent accounting group becomes
|
||||||
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
||||||
|
@ -19,11 +20,11 @@ the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
|
||||||
by this group which is essentially the CPU time obtained by all the tasks
|
by this group which is essentially the CPU time obtained by all the tasks
|
||||||
in the system.
|
in the system.
|
||||||
|
|
||||||
New accounting groups can be created under the parent group /sys/fs/cgroup.
|
New accounting groups can be created under the parent group /sys/fs/cgroup::
|
||||||
|
|
||||||
# cd /sys/fs/cgroup
|
# cd /sys/fs/cgroup
|
||||||
# mkdir g1
|
# mkdir g1
|
||||||
# echo $$ > g1/tasks
|
# echo $$ > g1/tasks
|
||||||
|
|
||||||
The above steps create a new group g1 and move the current shell
|
The above steps create a new group g1 and move the current shell
|
||||||
process (bash) into it. CPU time consumed by this bash and its children
|
process (bash) into it. CPU time consumed by this bash and its children
|
|
@ -1,35 +1,36 @@
|
||||||
CPUSETS
|
=======
|
||||||
-------
|
CPUSETS
|
||||||
|
=======
|
||||||
|
|
||||||
Copyright (C) 2004 BULL SA.
|
Copyright (C) 2004 BULL SA.
|
||||||
|
|
||||||
Written by Simon.Derr@bull.net
|
Written by Simon.Derr@bull.net
|
||||||
|
|
||||||
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
|
||||||
Modified by Paul Jackson <pj@sgi.com>
|
- Modified by Paul Jackson <pj@sgi.com>
|
||||||
Modified by Christoph Lameter <cl@linux.com>
|
- Modified by Christoph Lameter <cl@linux.com>
|
||||||
Modified by Paul Menage <menage@google.com>
|
- Modified by Paul Menage <menage@google.com>
|
||||||
Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
- Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
||||||
|
|
||||||
CONTENTS:
|
.. CONTENTS:
|
||||||
=========
|
|
||||||
|
|
||||||
1. Cpusets
|
1. Cpusets
|
||||||
1.1 What are cpusets ?
|
1.1 What are cpusets ?
|
||||||
1.2 Why are cpusets needed ?
|
1.2 Why are cpusets needed ?
|
||||||
1.3 How are cpusets implemented ?
|
1.3 How are cpusets implemented ?
|
||||||
1.4 What are exclusive cpusets ?
|
1.4 What are exclusive cpusets ?
|
||||||
1.5 What is memory_pressure ?
|
1.5 What is memory_pressure ?
|
||||||
1.6 What is memory spread ?
|
1.6 What is memory spread ?
|
||||||
1.7 What is sched_load_balance ?
|
1.7 What is sched_load_balance ?
|
||||||
1.8 What is sched_relax_domain_level ?
|
1.8 What is sched_relax_domain_level ?
|
||||||
1.9 How do I use cpusets ?
|
1.9 How do I use cpusets ?
|
||||||
2. Usage Examples and Syntax
|
2. Usage Examples and Syntax
|
||||||
2.1 Basic Usage
|
2.1 Basic Usage
|
||||||
2.2 Adding/removing cpus
|
2.2 Adding/removing cpus
|
||||||
2.3 Setting flags
|
2.3 Setting flags
|
||||||
2.4 Attaching processes
|
2.4 Attaching processes
|
||||||
3. Questions
|
3. Questions
|
||||||
4. Contact
|
4. Contact
|
||||||
|
|
||||||
1. Cpusets
|
1. Cpusets
|
||||||
==========
|
==========
|
||||||
|
@ -48,7 +49,7 @@ hooks, beyond what is already present, required to manage dynamic
|
||||||
job placement on large systems.
|
job placement on large systems.
|
||||||
|
|
||||||
Cpusets use the generic cgroup subsystem described in
|
Cpusets use the generic cgroup subsystem described in
|
||||||
Documentation/cgroup-v1/cgroups.txt.
|
Documentation/cgroup-v1/cgroups.rst.
|
||||||
|
|
||||||
Requests by a task, using the sched_setaffinity(2) system call to
|
Requests by a task, using the sched_setaffinity(2) system call to
|
||||||
include CPUs in its CPU affinity mask, and using the mbind(2) and
|
include CPUs in its CPU affinity mask, and using the mbind(2) and
|
||||||
|
@ -157,7 +158,7 @@ modifying cpusets is via this cpuset file system.
|
||||||
The /proc/<pid>/status file for each task has four added lines,
|
The /proc/<pid>/status file for each task has four added lines,
|
||||||
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
|
displaying the task's cpus_allowed (on which CPUs it may be scheduled)
|
||||||
and mems_allowed (on which Memory Nodes it may obtain memory),
|
and mems_allowed (on which Memory Nodes it may obtain memory),
|
||||||
in the two formats seen in the following example:
|
in the two formats seen in the following example::
|
||||||
|
|
||||||
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
|
||||||
Cpus_allowed_list: 0-127
|
Cpus_allowed_list: 0-127
|
||||||
|
@ -181,6 +182,7 @@ files describing that cpuset:
|
||||||
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
|
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
|
||||||
|
|
||||||
In addition, only the root cpuset has the following file:
|
In addition, only the root cpuset has the following file:
|
||||||
|
|
||||||
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
|
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
|
||||||
|
|
||||||
New cpusets are created using the mkdir system call or shell
|
New cpusets are created using the mkdir system call or shell
|
||||||
|
@ -266,7 +268,8 @@ to monitor a cpuset for signs of memory pressure. It's up to the
|
||||||
batch manager or other user code to decide what to do about it and
|
batch manager or other user code to decide what to do about it and
|
||||||
take action.
|
take action.
|
||||||
|
|
||||||
==> Unless this feature is enabled by writing "1" to the special file
|
==>
|
||||||
|
Unless this feature is enabled by writing "1" to the special file
|
||||||
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
|
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance
|
||||||
code of __alloc_pages() for this metric reduces to simply noticing
|
code of __alloc_pages() for this metric reduces to simply noticing
|
||||||
that the cpuset_memory_pressure_enabled flag is zero. So only
|
that the cpuset_memory_pressure_enabled flag is zero. So only
|
||||||
|
@ -399,6 +402,7 @@ have tasks running on them unless explicitly assigned.
|
||||||
|
|
||||||
This default load balancing across all CPUs is not well suited for
|
This default load balancing across all CPUs is not well suited for
|
||||||
the following two situations:
|
the following two situations:
|
||||||
|
|
||||||
1) On large systems, load balancing across many CPUs is expensive.
|
1) On large systems, load balancing across many CPUs is expensive.
|
||||||
If the system is managed using cpusets to place independent jobs
|
If the system is managed using cpusets to place independent jobs
|
||||||
on separate sets of CPUs, full load balancing is unnecessary.
|
on separate sets of CPUs, full load balancing is unnecessary.
|
||||||
|
@ -501,6 +505,7 @@ all the CPUs that must be load balanced.
|
||||||
The cpuset code builds a new such partition and passes it to the
|
The cpuset code builds a new such partition and passes it to the
|
||||||
scheduler sched domain setup code, to have the sched domains rebuilt
|
scheduler sched domain setup code, to have the sched domains rebuilt
|
||||||
as necessary, whenever:
|
as necessary, whenever:
|
||||||
|
|
||||||
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
|
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
|
||||||
- or CPUs come or go from a cpuset with this flag enabled,
|
- or CPUs come or go from a cpuset with this flag enabled,
|
||||||
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
|
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
|
||||||
|
@ -553,13 +558,15 @@ this searching range as you like. This file takes int value which
|
||||||
indicates size of searching range in levels ideally as follows,
|
indicates size of searching range in levels ideally as follows,
|
||||||
otherwise initial value -1 that indicates the cpuset has no request.
|
otherwise initial value -1 that indicates the cpuset has no request.
|
||||||
|
|
||||||
-1 : no request. use system default or follow request of others.
|
====== ===========================================================
|
||||||
0 : no search.
|
-1 no request. use system default or follow request of others.
|
||||||
1 : search siblings (hyperthreads in a core).
|
0 no search.
|
||||||
2 : search cores in a package.
|
1 search siblings (hyperthreads in a core).
|
||||||
3 : search cpus in a node [= system wide on non-NUMA system]
|
2 search cores in a package.
|
||||||
4 : search nodes in a chunk of node [on NUMA system]
|
3 search cpus in a node [= system wide on non-NUMA system]
|
||||||
5 : search system wide [on NUMA system]
|
4 search nodes in a chunk of node [on NUMA system]
|
||||||
|
5 search system wide [on NUMA system]
|
||||||
|
====== ===========================================================
|
||||||
|
|
||||||
The system default is architecture dependent. The system default
|
The system default is architecture dependent. The system default
|
||||||
can be changed using the relax_domain_level= boot parameter.
|
can be changed using the relax_domain_level= boot parameter.
|
||||||
|
@ -578,13 +585,14 @@ and whether it is acceptable or not depends on your situation.
|
||||||
Don't modify this file if you are not sure.
|
Don't modify this file if you are not sure.
|
||||||
|
|
||||||
If your situation is:
|
If your situation is:
|
||||||
|
|
||||||
- The migration costs between each cpu can be assumed considerably
|
- The migration costs between each cpu can be assumed considerably
|
||||||
small(for you) due to your special application's behavior or
|
small(for you) due to your special application's behavior or
|
||||||
special hardware support for CPU cache etc.
|
special hardware support for CPU cache etc.
|
||||||
- The searching cost doesn't have impact(for you) or you can make
|
- The searching cost doesn't have impact(for you) or you can make
|
||||||
the searching cost enough small by managing cpuset to compact etc.
|
the searching cost enough small by managing cpuset to compact etc.
|
||||||
- The latency is required even it sacrifices cache hit rate etc.
|
- The latency is required even it sacrifices cache hit rate etc.
|
||||||
then increasing 'sched_relax_domain_level' would benefit you.
|
then increasing 'sched_relax_domain_level' would benefit you.
|
||||||
|
|
||||||
|
|
||||||
1.9 How do I use cpusets ?
|
1.9 How do I use cpusets ?
|
||||||
|
@ -678,7 +686,7 @@ To start a new job that is to be contained within a cpuset, the steps are:
|
||||||
|
|
||||||
For example, the following sequence of commands will setup a cpuset
|
For example, the following sequence of commands will setup a cpuset
|
||||||
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
|
||||||
and then start a subshell 'sh' in that cpuset:
|
and then start a subshell 'sh' in that cpuset::
|
||||||
|
|
||||||
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
|
||||||
cd /sys/fs/cgroup/cpuset
|
cd /sys/fs/cgroup/cpuset
|
||||||
|
@ -693,6 +701,7 @@ and then start a subshell 'sh' in that cpuset:
|
||||||
cat /proc/self/cpuset
|
cat /proc/self/cpuset
|
||||||
|
|
||||||
There are ways to query or modify cpusets:
|
There are ways to query or modify cpusets:
|
||||||
|
|
||||||
- via the cpuset file system directly, using the various cd, mkdir, echo,
|
- via the cpuset file system directly, using the various cd, mkdir, echo,
|
||||||
cat, rmdir commands from the shell, or their equivalent from C.
|
cat, rmdir commands from the shell, or their equivalent from C.
|
||||||
- via the C library libcpuset.
|
- via the C library libcpuset.
|
||||||
|
@ -722,115 +731,133 @@ Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
|
||||||
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
|
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
|
||||||
is the cpuset that holds the whole system.
|
is the cpuset that holds the whole system.
|
||||||
|
|
||||||
If you want to create a new cpuset under /sys/fs/cgroup/cpuset:
|
If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
|
||||||
# cd /sys/fs/cgroup/cpuset
|
|
||||||
# mkdir my_cpuset
|
|
||||||
|
|
||||||
Now you want to do something with this cpuset.
|
# cd /sys/fs/cgroup/cpuset
|
||||||
# cd my_cpuset
|
# mkdir my_cpuset
|
||||||
|
|
||||||
In this directory you can find several files:
|
Now you want to do something with this cpuset::
|
||||||
# ls
|
|
||||||
cgroup.clone_children cpuset.memory_pressure
|
# cd my_cpuset
|
||||||
cgroup.event_control cpuset.memory_spread_page
|
|
||||||
cgroup.procs cpuset.memory_spread_slab
|
In this directory you can find several files::
|
||||||
cpuset.cpu_exclusive cpuset.mems
|
|
||||||
cpuset.cpus cpuset.sched_load_balance
|
# ls
|
||||||
cpuset.mem_exclusive cpuset.sched_relax_domain_level
|
cgroup.clone_children cpuset.memory_pressure
|
||||||
cpuset.mem_hardwall notify_on_release
|
cgroup.event_control cpuset.memory_spread_page
|
||||||
cpuset.memory_migrate tasks
|
cgroup.procs cpuset.memory_spread_slab
|
||||||
|
cpuset.cpu_exclusive cpuset.mems
|
||||||
|
cpuset.cpus cpuset.sched_load_balance
|
||||||
|
cpuset.mem_exclusive cpuset.sched_relax_domain_level
|
||||||
|
cpuset.mem_hardwall notify_on_release
|
||||||
|
cpuset.memory_migrate tasks
|
||||||
|
|
||||||
Reading them will give you information about the state of this cpuset:
|
Reading them will give you information about the state of this cpuset:
|
||||||
the CPUs and Memory Nodes it can use, the processes that are using
|
the CPUs and Memory Nodes it can use, the processes that are using
|
||||||
it, its properties. By writing to these files you can manipulate
|
it, its properties. By writing to these files you can manipulate
|
||||||
the cpuset.
|
the cpuset.
|
||||||
|
|
||||||
Set some flags:
|
Set some flags::
|
||||||
# /bin/echo 1 > cpuset.cpu_exclusive
|
|
||||||
|
|
||||||
Add some cpus:
|
# /bin/echo 1 > cpuset.cpu_exclusive
|
||||||
# /bin/echo 0-7 > cpuset.cpus
|
|
||||||
|
|
||||||
Add some mems:
|
Add some cpus::
|
||||||
# /bin/echo 0-7 > cpuset.mems
|
|
||||||
|
|
||||||
Now attach your shell to this cpuset:
|
# /bin/echo 0-7 > cpuset.cpus
|
||||||
# /bin/echo $$ > tasks
|
|
||||||
|
Add some mems::
|
||||||
|
|
||||||
|
# /bin/echo 0-7 > cpuset.mems
|
||||||
|
|
||||||
|
Now attach your shell to this cpuset::
|
||||||
|
|
||||||
|
# /bin/echo $$ > tasks
|
||||||
|
|
||||||
You can also create cpusets inside your cpuset by using mkdir in this
|
You can also create cpusets inside your cpuset by using mkdir in this
|
||||||
directory.
|
directory::
|
||||||
# mkdir my_sub_cs
|
|
||||||
|
# mkdir my_sub_cs
|
||||||
|
|
||||||
|
To remove a cpuset, just use rmdir::
|
||||||
|
|
||||||
|
# rmdir my_sub_cs
|
||||||
|
|
||||||
To remove a cpuset, just use rmdir:
|
|
||||||
# rmdir my_sub_cs
|
|
||||||
This will fail if the cpuset is in use (has cpusets inside, or has
|
This will fail if the cpuset is in use (has cpusets inside, or has
|
||||||
processes attached).
|
processes attached).
|
||||||
|
|
||||||
Note that for legacy reasons, the "cpuset" filesystem exists as a
|
Note that for legacy reasons, the "cpuset" filesystem exists as a
|
||||||
wrapper around the cgroup filesystem.
|
wrapper around the cgroup filesystem.
|
||||||
|
|
||||||
The command
|
The command::
|
||||||
|
|
||||||
mount -t cpuset X /sys/fs/cgroup/cpuset
|
mount -t cpuset X /sys/fs/cgroup/cpuset
|
||||||
|
|
||||||
is equivalent to
|
is equivalent to::
|
||||||
|
|
||||||
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
|
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
|
||||||
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
|
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
|
||||||
|
|
||||||
2.2 Adding/removing cpus
|
2.2 Adding/removing cpus
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
This is the syntax to use when writing in the cpus or mems files
|
This is the syntax to use when writing in the cpus or mems files
|
||||||
in cpuset directories:
|
in cpuset directories::
|
||||||
|
|
||||||
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
||||||
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
|
||||||
|
|
||||||
To add a CPU to a cpuset, write the new list of CPUs including the
|
To add a CPU to a cpuset, write the new list of CPUs including the
|
||||||
CPU to be added. To add 6 to the above cpuset:
|
CPU to be added. To add 6 to the above cpuset::
|
||||||
|
|
||||||
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
|
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
|
||||||
|
|
||||||
Similarly to remove a CPU from a cpuset, write the new list of CPUs
|
Similarly to remove a CPU from a cpuset, write the new list of CPUs
|
||||||
without the CPU to be removed.
|
without the CPU to be removed.
|
||||||
|
|
||||||
To remove all the CPUs:
|
To remove all the CPUs::
|
||||||
|
|
||||||
# /bin/echo "" > cpuset.cpus -> clear cpus list
|
# /bin/echo "" > cpuset.cpus -> clear cpus list
|
||||||
|
|
||||||
2.3 Setting flags
|
2.3 Setting flags
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
The syntax is very simple:
|
The syntax is very simple::
|
||||||
|
|
||||||
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
|
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
|
||||||
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
|
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
|
||||||
|
|
||||||
2.4 Attaching processes
|
2.4 Attaching processes
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
# /bin/echo PID > tasks
|
::
|
||||||
|
|
||||||
|
# /bin/echo PID > tasks
|
||||||
|
|
||||||
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
Note that it is PID, not PIDs. You can only attach ONE task at a time.
|
||||||
If you have several tasks to attach, you have to do it one after another:
|
If you have several tasks to attach, you have to do it one after another::
|
||||||
|
|
||||||
# /bin/echo PID1 > tasks
|
# /bin/echo PID1 > tasks
|
||||||
# /bin/echo PID2 > tasks
|
# /bin/echo PID2 > tasks
|
||||||
...
|
...
|
||||||
# /bin/echo PIDn > tasks
|
# /bin/echo PIDn > tasks
|
||||||
|
|
||||||
|
|
||||||
3. Questions
|
3. Questions
|
||||||
============
|
============
|
||||||
|
|
||||||
Q: what's up with this '/bin/echo' ?
|
Q:
|
||||||
A: bash's builtin 'echo' command does not check calls to write() against
|
what's up with this '/bin/echo' ?
|
||||||
|
|
||||||
|
A:
|
||||||
|
bash's builtin 'echo' command does not check calls to write() against
|
||||||
errors. If you use it in the cpuset file system, you won't be
|
errors. If you use it in the cpuset file system, you won't be
|
||||||
able to tell whether a command succeeded or failed.
|
able to tell whether a command succeeded or failed.
|
||||||
|
|
||||||
Q: When I attach processes, only the first of the line gets really attached !
|
Q:
|
||||||
A: We can only return one error code per call to write(). So you should also
|
When I attach processes, only the first of the line gets really attached !
|
||||||
|
|
||||||
|
A:
|
||||||
|
We can only return one error code per call to write(). So you should also
|
||||||
put only ONE pid.
|
put only ONE pid.
|
||||||
|
|
||||||
4. Contact
|
4. Contact
|
|
@ -1,6 +1,9 @@
|
||||||
|
===========================
|
||||||
Device Whitelist Controller
|
Device Whitelist Controller
|
||||||
|
===========================
|
||||||
|
|
||||||
1. Description:
|
1. Description
|
||||||
|
==============
|
||||||
|
|
||||||
Implement a cgroup to track and enforce open and mknod restrictions
|
Implement a cgroup to track and enforce open and mknod restrictions
|
||||||
on device files. A device cgroup associates a device access
|
on device files. A device cgroup associates a device access
|
||||||
|
@ -16,24 +19,26 @@ devices from the whitelist or add new entries. A child cgroup can
|
||||||
never receive a device access which is denied by its parent.
|
never receive a device access which is denied by its parent.
|
||||||
|
|
||||||
2. User Interface
|
2. User Interface
|
||||||
|
=================
|
||||||
|
|
||||||
An entry is added using devices.allow, and removed using
|
An entry is added using devices.allow, and removed using
|
||||||
devices.deny. For instance
|
devices.deny. For instance::
|
||||||
|
|
||||||
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
|
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
|
||||||
|
|
||||||
allows cgroup 1 to read and mknod the device usually known as
|
allows cgroup 1 to read and mknod the device usually known as
|
||||||
/dev/null. Doing
|
/dev/null. Doing::
|
||||||
|
|
||||||
echo a > /sys/fs/cgroup/1/devices.deny
|
echo a > /sys/fs/cgroup/1/devices.deny
|
||||||
|
|
||||||
will remove the default 'a *:* rwm' entry. Doing
|
will remove the default 'a *:* rwm' entry. Doing::
|
||||||
|
|
||||||
echo a > /sys/fs/cgroup/1/devices.allow
|
echo a > /sys/fs/cgroup/1/devices.allow
|
||||||
|
|
||||||
will add the 'a *:* rwm' entry to the whitelist.
|
will add the 'a *:* rwm' entry to the whitelist.
|
||||||
|
|
||||||
3. Security
|
3. Security
|
||||||
|
===========
|
||||||
|
|
||||||
Any task can move itself between cgroups. This clearly won't
|
Any task can move itself between cgroups. This clearly won't
|
||||||
suffice, but we can decide the best way to adequately restrict
|
suffice, but we can decide the best way to adequately restrict
|
||||||
|
@ -50,6 +55,7 @@ A cgroup may not be granted more permissions than the cgroup's
|
||||||
parent has.
|
parent has.
|
||||||
|
|
||||||
4. Hierarchy
|
4. Hierarchy
|
||||||
|
============
|
||||||
|
|
||||||
device cgroups maintain hierarchy by making sure a cgroup never has more
|
device cgroups maintain hierarchy by making sure a cgroup never has more
|
||||||
access permissions than its parent. Every time an entry is written to
|
access permissions than its parent. Every time an entry is written to
|
||||||
|
@ -58,7 +64,8 @@ from their whitelist and all the locally set whitelist entries will be
|
||||||
re-evaluated. In case one of the locally set whitelist entries would provide
|
re-evaluated. In case one of the locally set whitelist entries would provide
|
||||||
more access than the cgroup's parent, it'll be removed from the whitelist.
|
more access than the cgroup's parent, it'll be removed from the whitelist.
|
||||||
|
|
||||||
Example:
|
Example::
|
||||||
|
|
||||||
A
|
A
|
||||||
/ \
|
/ \
|
||||||
B
|
B
|
||||||
|
@ -67,10 +74,12 @@ Example:
|
||||||
A allow "b 8:* rwm", "c 116:1 rw"
|
A allow "b 8:* rwm", "c 116:1 rw"
|
||||||
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
|
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
|
||||||
|
|
||||||
If a device is denied in group A:
|
If a device is denied in group A::
|
||||||
|
|
||||||
# echo "c 116:* r" > A/devices.deny
|
# echo "c 116:* r" > A/devices.deny
|
||||||
|
|
||||||
it'll propagate down and after revalidating B's entries, the whitelist entry
|
it'll propagate down and after revalidating B's entries, the whitelist entry
|
||||||
"c 116:2 rwm" will be removed:
|
"c 116:2 rwm" will be removed::
|
||||||
|
|
||||||
group whitelist entries denied devices
|
group whitelist entries denied devices
|
||||||
A all "b 8:* rwm", "c 116:* rw"
|
A all "b 8:* rwm", "c 116:* rw"
|
||||||
|
@ -79,7 +88,8 @@ it'll propagate down and after revalidating B's entries, the whitelist entry
|
||||||
In case parent's exceptions change and local exceptions are not allowed
|
In case parent's exceptions change and local exceptions are not allowed
|
||||||
anymore, they'll be deleted.
|
anymore, they'll be deleted.
|
||||||
|
|
||||||
Notice that new whitelist entries will not be propagated:
|
Notice that new whitelist entries will not be propagated::
|
||||||
|
|
||||||
A
|
A
|
||||||
/ \
|
/ \
|
||||||
B
|
B
|
||||||
|
@ -88,24 +98,30 @@ Notice that new whitelist entries will not be propagated:
|
||||||
A "c 1:3 rwm", "c 1:5 r" all the rest
|
A "c 1:3 rwm", "c 1:5 r" all the rest
|
||||||
B "c 1:3 rwm", "c 1:5 r" all the rest
|
B "c 1:3 rwm", "c 1:5 r" all the rest
|
||||||
|
|
||||||
when adding "c *:3 rwm":
|
when adding ``c *:3 rwm``::
|
||||||
|
|
||||||
# echo "c *:3 rwm" >A/devices.allow
|
# echo "c *:3 rwm" >A/devices.allow
|
||||||
|
|
||||||
the result:
|
the result::
|
||||||
|
|
||||||
group whitelist entries denied devices
|
group whitelist entries denied devices
|
||||||
A "c *:3 rwm", "c 1:5 r" all the rest
|
A "c *:3 rwm", "c 1:5 r" all the rest
|
||||||
B "c 1:3 rwm", "c 1:5 r" all the rest
|
B "c 1:3 rwm", "c 1:5 r" all the rest
|
||||||
|
|
||||||
but now it'll be possible to add new entries to B:
|
but now it'll be possible to add new entries to B::
|
||||||
|
|
||||||
# echo "c 2:3 rwm" >B/devices.allow
|
# echo "c 2:3 rwm" >B/devices.allow
|
||||||
# echo "c 50:3 r" >B/devices.allow
|
# echo "c 50:3 r" >B/devices.allow
|
||||||
or even
|
|
||||||
|
or even::
|
||||||
|
|
||||||
# echo "c *:3 rwm" >B/devices.allow
|
# echo "c *:3 rwm" >B/devices.allow
|
||||||
|
|
||||||
Allowing or denying all by writing 'a' to devices.allow or devices.deny will
|
Allowing or denying all by writing 'a' to devices.allow or devices.deny will
|
||||||
not be possible once the device cgroups has children.
|
not be possible once the device cgroups has children.
|
||||||
|
|
||||||
4.1 Hierarchy (internal implementation)
|
4.1 Hierarchy (internal implementation)
|
||||||
|
---------------------------------------
|
||||||
|
|
||||||
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
|
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
|
||||||
list of exceptions. The internal state is controlled using the same user
|
list of exceptions. The internal state is controlled using the same user
|
|
@ -1,3 +1,7 @@
|
||||||
|
==============
|
||||||
|
Cgroup Freezer
|
||||||
|
==============
|
||||||
|
|
||||||
The cgroup freezer is useful to batch job management system which start
|
The cgroup freezer is useful to batch job management system which start
|
||||||
and stop sets of tasks in order to schedule the resources of a machine
|
and stop sets of tasks in order to schedule the resources of a machine
|
||||||
according to the desires of a system administrator. This sort of program
|
according to the desires of a system administrator. This sort of program
|
||||||
|
@ -23,7 +27,7 @@ blocked, or ignored it can be seen by waiting or ptracing parent tasks.
|
||||||
SIGCONT is especially unsuitable since it can be caught by the task. Any
|
SIGCONT is especially unsuitable since it can be caught by the task. Any
|
||||||
programs designed to watch for SIGSTOP and SIGCONT could be broken by
|
programs designed to watch for SIGSTOP and SIGCONT could be broken by
|
||||||
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
|
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
|
||||||
demonstrate this problem using nested bash shells:
|
demonstrate this problem using nested bash shells::
|
||||||
|
|
||||||
$ echo $$
|
$ echo $$
|
||||||
16644
|
16644
|
||||||
|
@ -93,19 +97,19 @@ The following cgroupfs files are created by cgroup freezer.
|
||||||
The root cgroup is non-freezable and the above interface files don't
|
The root cgroup is non-freezable and the above interface files don't
|
||||||
exist.
|
exist.
|
||||||
|
|
||||||
* Examples of usage :
|
* Examples of usage::
|
||||||
|
|
||||||
# mkdir /sys/fs/cgroup/freezer
|
# mkdir /sys/fs/cgroup/freezer
|
||||||
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
|
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
|
||||||
# mkdir /sys/fs/cgroup/freezer/0
|
# mkdir /sys/fs/cgroup/freezer/0
|
||||||
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
|
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
|
||||||
|
|
||||||
to get status of the freezer subsystem :
|
to get status of the freezer subsystem::
|
||||||
|
|
||||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||||
THAWED
|
THAWED
|
||||||
|
|
||||||
to freeze all tasks in the container :
|
to freeze all tasks in the container::
|
||||||
|
|
||||||
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
|
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
|
||||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||||
|
@ -113,7 +117,7 @@ to freeze all tasks in the container :
|
||||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
||||||
FROZEN
|
FROZEN
|
||||||
|
|
||||||
to unfreeze all tasks in the container :
|
to unfreeze all tasks in the container::
|
||||||
|
|
||||||
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
|
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
|
||||||
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
# cat /sys/fs/cgroup/freezer/0/freezer.state
|
|
@ -1,5 +1,6 @@
|
||||||
|
==================
|
||||||
HugeTLB Controller
|
HugeTLB Controller
|
||||||
-------------------
|
==================
|
||||||
|
|
||||||
The HugeTLB controller allows to limit the HugeTLB usage per control group and
|
The HugeTLB controller allows to limit the HugeTLB usage per control group and
|
||||||
enforces the controller limit during page fault. Since HugeTLB doesn't
|
enforces the controller limit during page fault. Since HugeTLB doesn't
|
||||||
|
@ -16,16 +17,16 @@ With the above step, the initial or the parent HugeTLB group becomes
|
||||||
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
|
||||||
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
|
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
|
||||||
|
|
||||||
New groups can be created under the parent group /sys/fs/cgroup.
|
New groups can be created under the parent group /sys/fs/cgroup::
|
||||||
|
|
||||||
# cd /sys/fs/cgroup
|
# cd /sys/fs/cgroup
|
||||||
# mkdir g1
|
# mkdir g1
|
||||||
# echo $$ > g1/tasks
|
# echo $$ > g1/tasks
|
||||||
|
|
||||||
The above steps create a new group g1 and move the current shell
|
The above steps create a new group g1 and move the current shell
|
||||||
process (bash) into it.
|
process (bash) into it.
|
||||||
|
|
||||||
Brief summary of control files
|
Brief summary of control files::
|
||||||
|
|
||||||
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
|
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
|
||||||
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
|
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
|
||||||
|
@ -33,17 +34,17 @@ Brief summary of control files
|
||||||
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
|
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
|
||||||
|
|
||||||
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
|
For a system supporting three hugepage sizes (64k, 32M and 1G), the control
|
||||||
files include:
|
files include::
|
||||||
|
|
||||||
hugetlb.1GB.limit_in_bytes
|
hugetlb.1GB.limit_in_bytes
|
||||||
hugetlb.1GB.max_usage_in_bytes
|
hugetlb.1GB.max_usage_in_bytes
|
||||||
hugetlb.1GB.usage_in_bytes
|
hugetlb.1GB.usage_in_bytes
|
||||||
hugetlb.1GB.failcnt
|
hugetlb.1GB.failcnt
|
||||||
hugetlb.64KB.limit_in_bytes
|
hugetlb.64KB.limit_in_bytes
|
||||||
hugetlb.64KB.max_usage_in_bytes
|
hugetlb.64KB.max_usage_in_bytes
|
||||||
hugetlb.64KB.usage_in_bytes
|
hugetlb.64KB.usage_in_bytes
|
||||||
hugetlb.64KB.failcnt
|
hugetlb.64KB.failcnt
|
||||||
hugetlb.32MB.limit_in_bytes
|
hugetlb.32MB.limit_in_bytes
|
||||||
hugetlb.32MB.max_usage_in_bytes
|
hugetlb.32MB.max_usage_in_bytes
|
||||||
hugetlb.32MB.usage_in_bytes
|
hugetlb.32MB.usage_in_bytes
|
||||||
hugetlb.32MB.failcnt
|
hugetlb.32MB.failcnt
|
|
@ -0,0 +1,30 @@
|
||||||
|
:orphan:
|
||||||
|
|
||||||
|
========================
|
||||||
|
Control Groups version 1
|
||||||
|
========================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
cgroups
|
||||||
|
|
||||||
|
blkio-controller
|
||||||
|
cpuacct
|
||||||
|
cpusets
|
||||||
|
devices
|
||||||
|
freezer-subsystem
|
||||||
|
hugetlb
|
||||||
|
memcg_test
|
||||||
|
memory
|
||||||
|
net_cls
|
||||||
|
net_prio
|
||||||
|
pids
|
||||||
|
rdma
|
||||||
|
|
||||||
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
Indices
|
||||||
|
=======
|
||||||
|
|
||||||
|
* :ref:`genindex`
|
|
@ -1,32 +1,43 @@
|
||||||
Memory Resource Controller(Memcg) Implementation Memo.
|
=====================================================
|
||||||
|
Memory Resource Controller(Memcg) Implementation Memo
|
||||||
|
=====================================================
|
||||||
|
|
||||||
Last Updated: 2010/2
|
Last Updated: 2010/2
|
||||||
|
|
||||||
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
|
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
|
||||||
|
|
||||||
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
|
Because VM is getting complex (one of reasons is memcg...), memcg's behavior
|
||||||
is complex. This is a document for memcg's internal behavior.
|
is complex. This is a document for memcg's internal behavior.
|
||||||
Please note that implementation details can be changed.
|
Please note that implementation details can be changed.
|
||||||
|
|
||||||
(*) Topics on API should be in Documentation/cgroup-v1/memory.txt)
|
(*) Topics on API should be in Documentation/cgroup-v1/memory.rst)
|
||||||
|
|
||||||
0. How to record usage ?
|
0. How to record usage ?
|
||||||
|
========================
|
||||||
|
|
||||||
2 objects are used.
|
2 objects are used.
|
||||||
|
|
||||||
page_cgroup ....an object per page.
|
page_cgroup ....an object per page.
|
||||||
|
|
||||||
Allocated at boot or memory hotplug. Freed at memory hot removal.
|
Allocated at boot or memory hotplug. Freed at memory hot removal.
|
||||||
|
|
||||||
swap_cgroup ... an entry per swp_entry.
|
swap_cgroup ... an entry per swp_entry.
|
||||||
|
|
||||||
Allocated at swapon(). Freed at swapoff().
|
Allocated at swapon(). Freed at swapoff().
|
||||||
|
|
||||||
The page_cgroup has USED bit and double count against a page_cgroup never
|
The page_cgroup has USED bit and double count against a page_cgroup never
|
||||||
occurs. swap_cgroup is used only when a charged page is swapped-out.
|
occurs. swap_cgroup is used only when a charged page is swapped-out.
|
||||||
|
|
||||||
1. Charge
|
1. Charge
|
||||||
|
=========
|
||||||
|
|
||||||
a page/swp_entry may be charged (usage += PAGE_SIZE) at
|
a page/swp_entry may be charged (usage += PAGE_SIZE) at
|
||||||
|
|
||||||
mem_cgroup_try_charge()
|
mem_cgroup_try_charge()
|
||||||
|
|
||||||
2. Uncharge
|
2. Uncharge
|
||||||
|
===========
|
||||||
|
|
||||||
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
|
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
|
||||||
|
|
||||||
mem_cgroup_uncharge()
|
mem_cgroup_uncharge()
|
||||||
|
@ -37,9 +48,12 @@ Please note that implementation details can be changed.
|
||||||
disappears.
|
disappears.
|
||||||
|
|
||||||
3. charge-commit-cancel
|
3. charge-commit-cancel
|
||||||
|
=======================
|
||||||
|
|
||||||
Memcg pages are charged in two steps:
|
Memcg pages are charged in two steps:
|
||||||
mem_cgroup_try_charge()
|
|
||||||
mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
|
- mem_cgroup_try_charge()
|
||||||
|
- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
|
||||||
|
|
||||||
At try_charge(), there are no flags to say "this page is charged".
|
At try_charge(), there are no flags to say "this page is charged".
|
||||||
at this point, usage += PAGE_SIZE.
|
at this point, usage += PAGE_SIZE.
|
||||||
|
@ -51,6 +65,8 @@ Please note that implementation details can be changed.
|
||||||
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
||||||
|
|
||||||
4. Anonymous
|
4. Anonymous
|
||||||
|
============
|
||||||
|
|
||||||
Anonymous page is newly allocated at
|
Anonymous page is newly allocated at
|
||||||
- page fault into MAP_ANONYMOUS mapping.
|
- page fault into MAP_ANONYMOUS mapping.
|
||||||
- Copy-On-Write.
|
- Copy-On-Write.
|
||||||
|
@ -78,34 +94,45 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
||||||
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
|
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
|
||||||
|
|
||||||
5. Page Cache
|
5. Page Cache
|
||||||
Page Cache is charged at
|
=============
|
||||||
|
|
||||||
|
Page Cache is charged at
|
||||||
- add_to_page_cache_locked().
|
- add_to_page_cache_locked().
|
||||||
|
|
||||||
The logic is very clear. (About migration, see below)
|
The logic is very clear. (About migration, see below)
|
||||||
Note: __remove_from_page_cache() is called by remove_from_page_cache()
|
|
||||||
and __remove_mapping().
|
Note:
|
||||||
|
__remove_from_page_cache() is called by remove_from_page_cache()
|
||||||
|
and __remove_mapping().
|
||||||
|
|
||||||
6. Shmem(tmpfs) Page Cache
|
6. Shmem(tmpfs) Page Cache
|
||||||
|
===========================
|
||||||
|
|
||||||
The best way to understand shmem's page state transition is to read
|
The best way to understand shmem's page state transition is to read
|
||||||
mm/shmem.c.
|
mm/shmem.c.
|
||||||
|
|
||||||
But brief explanation of the behavior of memcg around shmem will be
|
But brief explanation of the behavior of memcg around shmem will be
|
||||||
helpful to understand the logic.
|
helpful to understand the logic.
|
||||||
|
|
||||||
Shmem's page (just leaf page, not direct/indirect block) can be on
|
Shmem's page (just leaf page, not direct/indirect block) can be on
|
||||||
|
|
||||||
- radix-tree of shmem's inode.
|
- radix-tree of shmem's inode.
|
||||||
- SwapCache.
|
- SwapCache.
|
||||||
- Both on radix-tree and SwapCache. This happens at swap-in
|
- Both on radix-tree and SwapCache. This happens at swap-in
|
||||||
and swap-out,
|
and swap-out,
|
||||||
|
|
||||||
It's charged when...
|
It's charged when...
|
||||||
|
|
||||||
- A new page is added to shmem's radix-tree.
|
- A new page is added to shmem's radix-tree.
|
||||||
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
|
- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
|
||||||
|
|
||||||
7. Page Migration
|
7. Page Migration
|
||||||
|
=================
|
||||||
|
|
||||||
mem_cgroup_migrate()
|
mem_cgroup_migrate()
|
||||||
|
|
||||||
8. LRU
|
8. LRU
|
||||||
|
======
|
||||||
Each memcg has its own private LRU. Now, its handling is under global
|
Each memcg has its own private LRU. Now, its handling is under global
|
||||||
VM's control (means that it's handled under global pgdat->lru_lock).
|
VM's control (means that it's handled under global pgdat->lru_lock).
|
||||||
Almost all routines around memcg's LRU is called by global LRU's
|
Almost all routines around memcg's LRU is called by global LRU's
|
||||||
|
@ -114,163 +141,211 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
|
||||||
A special function is mem_cgroup_isolate_pages(). This scans
|
A special function is mem_cgroup_isolate_pages(). This scans
|
||||||
memcg's private LRU and call __isolate_lru_page() to extract a page
|
memcg's private LRU and call __isolate_lru_page() to extract a page
|
||||||
from LRU.
|
from LRU.
|
||||||
|
|
||||||
(By __isolate_lru_page(), the page is removed from both of global and
|
(By __isolate_lru_page(), the page is removed from both of global and
|
||||||
private LRU.)
|
private LRU.)
|
||||||
|
|
||||||
|
|
||||||
9. Typical Tests.
|
9. Typical Tests.
|
||||||
|
=================
|
||||||
|
|
||||||
Tests for racy cases.
|
Tests for racy cases.
|
||||||
|
|
||||||
9.1 Small limit to memcg.
|
9.1 Small limit to memcg.
|
||||||
|
-------------------------
|
||||||
|
|
||||||
When you do test to do racy case, it's good test to set memcg's limit
|
When you do test to do racy case, it's good test to set memcg's limit
|
||||||
to be very small rather than GB. Many races found in the test under
|
to be very small rather than GB. Many races found in the test under
|
||||||
xKB or xxMB limits.
|
xKB or xxMB limits.
|
||||||
(Memory behavior under GB and Memory behavior under MB shows very
|
|
||||||
different situation.)
|
|
||||||
|
|
||||||
9.2 Shmem
|
(Memory behavior under GB and Memory behavior under MB shows very
|
||||||
|
different situation.)
|
||||||
|
|
||||||
|
9.2 Shmem
|
||||||
|
---------
|
||||||
|
|
||||||
Historically, memcg's shmem handling was poor and we saw some amount
|
Historically, memcg's shmem handling was poor and we saw some amount
|
||||||
of troubles here. This is because shmem is page-cache but can be
|
of troubles here. This is because shmem is page-cache but can be
|
||||||
SwapCache. Test with shmem/tmpfs is always good test.
|
SwapCache. Test with shmem/tmpfs is always good test.
|
||||||
|
|
||||||
9.3 Migration
|
9.3 Migration
|
||||||
|
-------------
|
||||||
|
|
||||||
For NUMA, migration is an another special case. To do easy test, cpuset
|
For NUMA, migration is an another special case. To do easy test, cpuset
|
||||||
is useful. Following is a sample script to do migration.
|
is useful. Following is a sample script to do migration::
|
||||||
|
|
||||||
mount -t cgroup -o cpuset none /opt/cpuset
|
mount -t cgroup -o cpuset none /opt/cpuset
|
||||||
|
|
||||||
mkdir /opt/cpuset/01
|
mkdir /opt/cpuset/01
|
||||||
echo 1 > /opt/cpuset/01/cpuset.cpus
|
echo 1 > /opt/cpuset/01/cpuset.cpus
|
||||||
echo 0 > /opt/cpuset/01/cpuset.mems
|
echo 0 > /opt/cpuset/01/cpuset.mems
|
||||||
echo 1 > /opt/cpuset/01/cpuset.memory_migrate
|
echo 1 > /opt/cpuset/01/cpuset.memory_migrate
|
||||||
mkdir /opt/cpuset/02
|
mkdir /opt/cpuset/02
|
||||||
echo 1 > /opt/cpuset/02/cpuset.cpus
|
echo 1 > /opt/cpuset/02/cpuset.cpus
|
||||||
echo 1 > /opt/cpuset/02/cpuset.mems
|
echo 1 > /opt/cpuset/02/cpuset.mems
|
||||||
echo 1 > /opt/cpuset/02/cpuset.memory_migrate
|
echo 1 > /opt/cpuset/02/cpuset.memory_migrate
|
||||||
|
|
||||||
In above set, when you moves a task from 01 to 02, page migration to
|
In above set, when you moves a task from 01 to 02, page migration to
|
||||||
node 0 to node 1 will occur. Following is a script to migrate all
|
node 0 to node 1 will occur. Following is a script to migrate all
|
||||||
under cpuset.
|
under cpuset.::
|
||||||
--
|
|
||||||
move_task()
|
--
|
||||||
{
|
move_task()
|
||||||
for pid in $1
|
{
|
||||||
do
|
for pid in $1
|
||||||
/bin/echo $pid >$2/tasks 2>/dev/null
|
do
|
||||||
echo -n $pid
|
/bin/echo $pid >$2/tasks 2>/dev/null
|
||||||
echo -n " "
|
echo -n $pid
|
||||||
done
|
echo -n " "
|
||||||
echo END
|
done
|
||||||
}
|
echo END
|
||||||
|
}
|
||||||
|
|
||||||
|
G1_TASK=`cat ${G1}/tasks`
|
||||||
|
G2_TASK=`cat ${G2}/tasks`
|
||||||
|
move_task "${G1_TASK}" ${G2} &
|
||||||
|
--
|
||||||
|
|
||||||
|
9.4 Memory hotplug
|
||||||
|
------------------
|
||||||
|
|
||||||
G1_TASK=`cat ${G1}/tasks`
|
|
||||||
G2_TASK=`cat ${G2}/tasks`
|
|
||||||
move_task "${G1_TASK}" ${G2} &
|
|
||||||
--
|
|
||||||
9.4 Memory hotplug.
|
|
||||||
memory hotplug test is one of good test.
|
memory hotplug test is one of good test.
|
||||||
to offline memory, do following.
|
|
||||||
# echo offline > /sys/devices/system/memory/memoryXXX/state
|
to offline memory, do following::
|
||||||
|
|
||||||
|
# echo offline > /sys/devices/system/memory/memoryXXX/state
|
||||||
|
|
||||||
(XXX is the place of memory)
|
(XXX is the place of memory)
|
||||||
|
|
||||||
This is an easy way to test page migration, too.
|
This is an easy way to test page migration, too.
|
||||||
|
|
||||||
9.5 mkdir/rmdir
|
9.5 mkdir/rmdir
|
||||||
|
---------------
|
||||||
|
|
||||||
When using hierarchy, mkdir/rmdir test should be done.
|
When using hierarchy, mkdir/rmdir test should be done.
|
||||||
Use tests like the following.
|
Use tests like the following::
|
||||||
|
|
||||||
echo 1 >/opt/cgroup/01/memory/use_hierarchy
|
echo 1 >/opt/cgroup/01/memory/use_hierarchy
|
||||||
mkdir /opt/cgroup/01/child_a
|
mkdir /opt/cgroup/01/child_a
|
||||||
mkdir /opt/cgroup/01/child_b
|
mkdir /opt/cgroup/01/child_b
|
||||||
|
|
||||||
set limit to 01.
|
set limit to 01.
|
||||||
add limit to 01/child_b
|
add limit to 01/child_b
|
||||||
run jobs under child_a and child_b
|
run jobs under child_a and child_b
|
||||||
|
|
||||||
create/delete following groups at random while jobs are running.
|
create/delete following groups at random while jobs are running::
|
||||||
/opt/cgroup/01/child_a/child_aa
|
|
||||||
/opt/cgroup/01/child_b/child_bb
|
/opt/cgroup/01/child_a/child_aa
|
||||||
/opt/cgroup/01/child_c
|
/opt/cgroup/01/child_b/child_bb
|
||||||
|
/opt/cgroup/01/child_c
|
||||||
|
|
||||||
running new jobs in new group is also good.
|
running new jobs in new group is also good.
|
||||||
|
|
||||||
9.6 Mount with other subsystems.
|
9.6 Mount with other subsystems
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
Mounting with other subsystems is a good test because there is a
|
Mounting with other subsystems is a good test because there is a
|
||||||
race and lock dependency with other cgroup subsystems.
|
race and lock dependency with other cgroup subsystems.
|
||||||
|
|
||||||
example)
|
example::
|
||||||
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
|
|
||||||
|
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
|
||||||
|
|
||||||
and do task move, mkdir, rmdir etc...under this.
|
and do task move, mkdir, rmdir etc...under this.
|
||||||
|
|
||||||
9.7 swapoff.
|
9.7 swapoff
|
||||||
|
-----------
|
||||||
|
|
||||||
Besides management of swap is one of complicated parts of memcg,
|
Besides management of swap is one of complicated parts of memcg,
|
||||||
call path of swap-in at swapoff is not same as usual swap-in path..
|
call path of swap-in at swapoff is not same as usual swap-in path..
|
||||||
It's worth to be tested explicitly.
|
It's worth to be tested explicitly.
|
||||||
|
|
||||||
For example, test like following is good.
|
For example, test like following is good:
|
||||||
(Shell-A)
|
|
||||||
# mount -t cgroup none /cgroup -o memory
|
(Shell-A)::
|
||||||
# mkdir /cgroup/test
|
|
||||||
# echo 40M > /cgroup/test/memory.limit_in_bytes
|
# mount -t cgroup none /cgroup -o memory
|
||||||
# echo 0 > /cgroup/test/tasks
|
# mkdir /cgroup/test
|
||||||
|
# echo 40M > /cgroup/test/memory.limit_in_bytes
|
||||||
|
# echo 0 > /cgroup/test/tasks
|
||||||
|
|
||||||
Run malloc(100M) program under this. You'll see 60M of swaps.
|
Run malloc(100M) program under this. You'll see 60M of swaps.
|
||||||
(Shell-B)
|
|
||||||
# move all tasks in /cgroup/test to /cgroup
|
(Shell-B)::
|
||||||
# /sbin/swapoff -a
|
|
||||||
# rmdir /cgroup/test
|
# move all tasks in /cgroup/test to /cgroup
|
||||||
# kill malloc task.
|
# /sbin/swapoff -a
|
||||||
|
# rmdir /cgroup/test
|
||||||
|
# kill malloc task.
|
||||||
|
|
||||||
Of course, tmpfs v.s. swapoff test should be tested, too.
|
Of course, tmpfs v.s. swapoff test should be tested, too.
|
||||||
|
|
||||||
9.8 OOM-Killer
|
9.8 OOM-Killer
|
||||||
|
--------------
|
||||||
|
|
||||||
Out-of-memory caused by memcg's limit will kill tasks under
|
Out-of-memory caused by memcg's limit will kill tasks under
|
||||||
the memcg. When hierarchy is used, a task under hierarchy
|
the memcg. When hierarchy is used, a task under hierarchy
|
||||||
will be killed by the kernel.
|
will be killed by the kernel.
|
||||||
|
|
||||||
In this case, panic_on_oom shouldn't be invoked and tasks
|
In this case, panic_on_oom shouldn't be invoked and tasks
|
||||||
in other groups shouldn't be killed.
|
in other groups shouldn't be killed.
|
||||||
|
|
||||||
It's not difficult to cause OOM under memcg as following.
|
It's not difficult to cause OOM under memcg as following.
|
||||||
Case A) when you can swapoff
|
|
||||||
#swapoff -a
|
Case A) when you can swapoff::
|
||||||
#echo 50M > /memory.limit_in_bytes
|
|
||||||
|
#swapoff -a
|
||||||
|
#echo 50M > /memory.limit_in_bytes
|
||||||
|
|
||||||
run 51M of malloc
|
run 51M of malloc
|
||||||
|
|
||||||
Case B) when you use mem+swap limitation.
|
Case B) when you use mem+swap limitation::
|
||||||
#echo 50M > memory.limit_in_bytes
|
|
||||||
#echo 50M > memory.memsw.limit_in_bytes
|
#echo 50M > memory.limit_in_bytes
|
||||||
|
#echo 50M > memory.memsw.limit_in_bytes
|
||||||
|
|
||||||
run 51M of malloc
|
run 51M of malloc
|
||||||
|
|
||||||
9.9 Move charges at task migration
|
9.9 Move charges at task migration
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
Charges associated with a task can be moved along with task migration.
|
Charges associated with a task can be moved along with task migration.
|
||||||
|
|
||||||
(Shell-A)
|
(Shell-A)::
|
||||||
#mkdir /cgroup/A
|
|
||||||
#echo $$ >/cgroup/A/tasks
|
#mkdir /cgroup/A
|
||||||
|
#echo $$ >/cgroup/A/tasks
|
||||||
|
|
||||||
run some programs which uses some amount of memory in /cgroup/A.
|
run some programs which uses some amount of memory in /cgroup/A.
|
||||||
|
|
||||||
(Shell-B)
|
(Shell-B)::
|
||||||
#mkdir /cgroup/B
|
|
||||||
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
|
|
||||||
#echo "pid of the program running in group A" >/cgroup/B/tasks
|
|
||||||
|
|
||||||
You can see charges have been moved by reading *.usage_in_bytes or
|
#mkdir /cgroup/B
|
||||||
|
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
|
||||||
|
#echo "pid of the program running in group A" >/cgroup/B/tasks
|
||||||
|
|
||||||
|
You can see charges have been moved by reading ``*.usage_in_bytes`` or
|
||||||
memory.stat of both A and B.
|
memory.stat of both A and B.
|
||||||
See 8.2 of Documentation/cgroup-v1/memory.txt to see what value should be
|
|
||||||
written to move_charge_at_immigrate.
|
|
||||||
|
|
||||||
9.10 Memory thresholds
|
See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should
|
||||||
|
be written to move_charge_at_immigrate.
|
||||||
|
|
||||||
|
9.10 Memory thresholds
|
||||||
|
----------------------
|
||||||
|
|
||||||
Memory controller implements memory thresholds using cgroups notification
|
Memory controller implements memory thresholds using cgroups notification
|
||||||
API. You can use tools/cgroup/cgroup_event_listener.c to test it.
|
API. You can use tools/cgroup/cgroup_event_listener.c to test it.
|
||||||
|
|
||||||
(Shell-A) Create cgroup and run event listener
|
(Shell-A) Create cgroup and run event listener::
|
||||||
# mkdir /cgroup/A
|
|
||||||
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
|
|
||||||
|
|
||||||
(Shell-B) Add task to cgroup and try to allocate and free memory
|
# mkdir /cgroup/A
|
||||||
# echo $$ >/cgroup/A/tasks
|
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
|
||||||
# a="$(dd if=/dev/zero bs=1M count=10)"
|
|
||||||
# a=
|
(Shell-B) Add task to cgroup and try to allocate and free memory::
|
||||||
|
|
||||||
|
# echo $$ >/cgroup/A/tasks
|
||||||
|
# a="$(dd if=/dev/zero bs=1M count=10)"
|
||||||
|
# a=
|
||||||
|
|
||||||
You will see message from cgroup_event_listener every time you cross
|
You will see message from cgroup_event_listener every time you cross
|
||||||
the thresholds.
|
the thresholds.
|
|
@ -1,22 +1,26 @@
|
||||||
|
==========================
|
||||||
Memory Resource Controller
|
Memory Resource Controller
|
||||||
|
==========================
|
||||||
|
|
||||||
NOTE: This document is hopelessly outdated and it asks for a complete
|
NOTE:
|
||||||
|
This document is hopelessly outdated and it asks for a complete
|
||||||
rewrite. It still contains a useful information so we are keeping it
|
rewrite. It still contains a useful information so we are keeping it
|
||||||
here but make sure to check the current code if you need a deeper
|
here but make sure to check the current code if you need a deeper
|
||||||
understanding.
|
understanding.
|
||||||
|
|
||||||
NOTE: The Memory Resource Controller has generically been referred to as the
|
NOTE:
|
||||||
|
The Memory Resource Controller has generically been referred to as the
|
||||||
memory controller in this document. Do not confuse memory controller
|
memory controller in this document. Do not confuse memory controller
|
||||||
used here with the memory controller that is used in hardware.
|
used here with the memory controller that is used in hardware.
|
||||||
|
|
||||||
(For editors)
|
(For editors) In this document:
|
||||||
In this document:
|
|
||||||
When we mention a cgroup (cgroupfs's directory) with memory controller,
|
When we mention a cgroup (cgroupfs's directory) with memory controller,
|
||||||
we call it "memory cgroup". When you see git-log and source code, you'll
|
we call it "memory cgroup". When you see git-log and source code, you'll
|
||||||
see patch's title and function names tend to use "memcg".
|
see patch's title and function names tend to use "memcg".
|
||||||
In this document, we avoid using it.
|
In this document, we avoid using it.
|
||||||
|
|
||||||
Benefits and Purpose of the memory controller
|
Benefits and Purpose of the memory controller
|
||||||
|
=============================================
|
||||||
|
|
||||||
The memory controller isolates the memory behaviour of a group of tasks
|
The memory controller isolates the memory behaviour of a group of tasks
|
||||||
from the rest of the system. The article on LWN [12] mentions some probable
|
from the rest of the system. The article on LWN [12] mentions some probable
|
||||||
|
@ -38,6 +42,7 @@ e. There are several other use cases; find one or use the controller just
|
||||||
Current Status: linux-2.6.34-mmotm(development version of 2010/April)
|
Current Status: linux-2.6.34-mmotm(development version of 2010/April)
|
||||||
|
|
||||||
Features:
|
Features:
|
||||||
|
|
||||||
- accounting anonymous pages, file caches, swap caches usage and limiting them.
|
- accounting anonymous pages, file caches, swap caches usage and limiting them.
|
||||||
- pages are linked to per-memcg LRU exclusively, and there is no global LRU.
|
- pages are linked to per-memcg LRU exclusively, and there is no global LRU.
|
||||||
- optionally, memory+swap usage can be accounted and limited.
|
- optionally, memory+swap usage can be accounted and limited.
|
||||||
|
@ -54,41 +59,48 @@ Features:
|
||||||
|
|
||||||
Brief summary of control files.
|
Brief summary of control files.
|
||||||
|
|
||||||
tasks # attach a task(thread) and show list of threads
|
==================================== ==========================================
|
||||||
cgroup.procs # show list of processes
|
tasks attach a task(thread) and show list of
|
||||||
cgroup.event_control # an interface for event_fd()
|
threads
|
||||||
memory.usage_in_bytes # show current usage for memory
|
cgroup.procs show list of processes
|
||||||
(See 5.5 for details)
|
cgroup.event_control an interface for event_fd()
|
||||||
memory.memsw.usage_in_bytes # show current usage for memory+Swap
|
memory.usage_in_bytes show current usage for memory
|
||||||
(See 5.5 for details)
|
(See 5.5 for details)
|
||||||
memory.limit_in_bytes # set/show limit of memory usage
|
memory.memsw.usage_in_bytes show current usage for memory+Swap
|
||||||
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage
|
(See 5.5 for details)
|
||||||
memory.failcnt # show the number of memory usage hits limits
|
memory.limit_in_bytes set/show limit of memory usage
|
||||||
memory.memsw.failcnt # show the number of memory+Swap hits limits
|
memory.memsw.limit_in_bytes set/show limit of memory+Swap usage
|
||||||
memory.max_usage_in_bytes # show max memory usage recorded
|
memory.failcnt show the number of memory usage hits limits
|
||||||
memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded
|
memory.memsw.failcnt show the number of memory+Swap hits limits
|
||||||
memory.soft_limit_in_bytes # set/show soft limit of memory usage
|
memory.max_usage_in_bytes show max memory usage recorded
|
||||||
memory.stat # show various statistics
|
memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
|
||||||
memory.use_hierarchy # set/show hierarchical account enabled
|
memory.soft_limit_in_bytes set/show soft limit of memory usage
|
||||||
memory.force_empty # trigger forced page reclaim
|
memory.stat show various statistics
|
||||||
memory.pressure_level # set memory pressure notifications
|
memory.use_hierarchy set/show hierarchical account enabled
|
||||||
memory.swappiness # set/show swappiness parameter of vmscan
|
memory.force_empty trigger forced page reclaim
|
||||||
(See sysctl's vm.swappiness)
|
memory.pressure_level set memory pressure notifications
|
||||||
memory.move_charge_at_immigrate # set/show controls of moving charges
|
memory.swappiness set/show swappiness parameter of vmscan
|
||||||
memory.oom_control # set/show oom controls.
|
(See sysctl's vm.swappiness)
|
||||||
memory.numa_stat # show the number of memory usage per numa node
|
memory.move_charge_at_immigrate set/show controls of moving charges
|
||||||
|
memory.oom_control set/show oom controls.
|
||||||
|
memory.numa_stat show the number of memory usage per numa
|
||||||
|
node
|
||||||
|
|
||||||
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
|
memory.kmem.limit_in_bytes set/show hard limit for kernel memory
|
||||||
memory.kmem.usage_in_bytes # show current kernel memory allocation
|
memory.kmem.usage_in_bytes show current kernel memory allocation
|
||||||
memory.kmem.failcnt # show the number of kernel memory usage hits limits
|
memory.kmem.failcnt show the number of kernel memory usage
|
||||||
memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded
|
hits limits
|
||||||
|
memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
|
||||||
|
|
||||||
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory
|
memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
|
||||||
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation
|
memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
|
||||||
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits
|
memory.kmem.tcp.failcnt show the number of tcp buf memory usage
|
||||||
memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded
|
hits limits
|
||||||
|
memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
|
||||||
|
==================================== ==========================================
|
||||||
|
|
||||||
1. History
|
1. History
|
||||||
|
==========
|
||||||
|
|
||||||
The memory controller has a long history. A request for comments for the memory
|
The memory controller has a long history. A request for comments for the memory
|
||||||
controller was posted by Balbir Singh [1]. At the time the RFC was posted
|
controller was posted by Balbir Singh [1]. At the time the RFC was posted
|
||||||
|
@ -103,6 +115,7 @@ at version 6; it combines both mapped (RSS) and unmapped Page
|
||||||
Cache Control [11].
|
Cache Control [11].
|
||||||
|
|
||||||
2. Memory Control
|
2. Memory Control
|
||||||
|
=================
|
||||||
|
|
||||||
Memory is a unique resource in the sense that it is present in a limited
|
Memory is a unique resource in the sense that it is present in a limited
|
||||||
amount. If a task requires a lot of CPU processing, the task can spread
|
amount. If a task requires a lot of CPU processing, the task can spread
|
||||||
|
@ -120,6 +133,7 @@ are:
|
||||||
The memory controller is the first controller developed.
|
The memory controller is the first controller developed.
|
||||||
|
|
||||||
2.1. Design
|
2.1. Design
|
||||||
|
-----------
|
||||||
|
|
||||||
The core of the design is a counter called the page_counter. The
|
The core of the design is a counter called the page_counter. The
|
||||||
page_counter tracks the current memory usage and limit of the group of
|
page_counter tracks the current memory usage and limit of the group of
|
||||||
|
@ -127,6 +141,9 @@ processes associated with the controller. Each cgroup has a memory controller
|
||||||
specific data structure (mem_cgroup) associated with it.
|
specific data structure (mem_cgroup) associated with it.
|
||||||
|
|
||||||
2.2. Accounting
|
2.2. Accounting
|
||||||
|
---------------
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
+--------------------+
|
+--------------------+
|
||||||
| mem_cgroup |
|
| mem_cgroup |
|
||||||
|
@ -165,6 +182,7 @@ updated. page_cgroup has its own LRU on cgroup.
|
||||||
(*) page_cgroup structure is allocated at boot/memory-hotplug time.
|
(*) page_cgroup structure is allocated at boot/memory-hotplug time.
|
||||||
|
|
||||||
2.2.1 Accounting details
|
2.2.1 Accounting details
|
||||||
|
------------------------
|
||||||
|
|
||||||
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
|
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
|
||||||
Some pages which are never reclaimable and will not be on the LRU
|
Some pages which are never reclaimable and will not be on the LRU
|
||||||
|
@ -191,6 +209,7 @@ Note: we just account pages-on-LRU because our purpose is to control amount
|
||||||
of used pages; not-on-LRU pages tend to be out-of-control from VM view.
|
of used pages; not-on-LRU pages tend to be out-of-control from VM view.
|
||||||
|
|
||||||
2.3 Shared Page Accounting
|
2.3 Shared Page Accounting
|
||||||
|
--------------------------
|
||||||
|
|
||||||
Shared pages are accounted on the basis of the first touch approach. The
|
Shared pages are accounted on the basis of the first touch approach. The
|
||||||
cgroup that first touches a page is accounted for the page. The principle
|
cgroup that first touches a page is accounted for the page. The principle
|
||||||
|
@ -207,11 +226,13 @@ be backed into memory in force, charges for pages are accounted against the
|
||||||
caller of swapoff rather than the users of shmem.
|
caller of swapoff rather than the users of shmem.
|
||||||
|
|
||||||
2.4 Swap Extension (CONFIG_MEMCG_SWAP)
|
2.4 Swap Extension (CONFIG_MEMCG_SWAP)
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
Swap Extension allows you to record charge for swap. A swapped-in page is
|
Swap Extension allows you to record charge for swap. A swapped-in page is
|
||||||
charged back to original page allocator if possible.
|
charged back to original page allocator if possible.
|
||||||
|
|
||||||
When swap is accounted, following files are added.
|
When swap is accounted, following files are added.
|
||||||
|
|
||||||
- memory.memsw.usage_in_bytes.
|
- memory.memsw.usage_in_bytes.
|
||||||
- memory.memsw.limit_in_bytes.
|
- memory.memsw.limit_in_bytes.
|
||||||
|
|
||||||
|
@ -224,14 +245,16 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
|
||||||
By using the memsw limit, you can avoid system OOM which can be caused by swap
|
By using the memsw limit, you can avoid system OOM which can be caused by swap
|
||||||
shortage.
|
shortage.
|
||||||
|
|
||||||
* why 'memory+swap' rather than swap.
|
**why 'memory+swap' rather than swap**
|
||||||
|
|
||||||
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
|
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
|
||||||
to move account from memory to swap...there is no change in usage of
|
to move account from memory to swap...there is no change in usage of
|
||||||
memory+swap. In other words, when we want to limit the usage of swap without
|
memory+swap. In other words, when we want to limit the usage of swap without
|
||||||
affecting global LRU, memory+swap limit is better than just limiting swap from
|
affecting global LRU, memory+swap limit is better than just limiting swap from
|
||||||
an OS point of view.
|
an OS point of view.
|
||||||
|
|
||||||
* What happens when a cgroup hits memory.memsw.limit_in_bytes
|
**What happens when a cgroup hits memory.memsw.limit_in_bytes**
|
||||||
|
|
||||||
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
|
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
|
||||||
in this cgroup. Then, swap-out will not be done by cgroup routine and file
|
in this cgroup. Then, swap-out will not be done by cgroup routine and file
|
||||||
caches are dropped. But as mentioned above, global LRU can do swapout memory
|
caches are dropped. But as mentioned above, global LRU can do swapout memory
|
||||||
|
@ -239,6 +262,7 @@ from it for sanity of the system's memory management state. You can't forbid
|
||||||
it by cgroup.
|
it by cgroup.
|
||||||
|
|
||||||
2.5 Reclaim
|
2.5 Reclaim
|
||||||
|
-----------
|
||||||
|
|
||||||
Each cgroup maintains a per cgroup LRU which has the same structure as
|
Each cgroup maintains a per cgroup LRU which has the same structure as
|
||||||
global VM. When a cgroup goes over its limit, we first try
|
global VM. When a cgroup goes over its limit, we first try
|
||||||
|
@ -251,29 +275,36 @@ The reclaim algorithm has not been modified for cgroups, except that
|
||||||
pages that are selected for reclaiming come from the per-cgroup LRU
|
pages that are selected for reclaiming come from the per-cgroup LRU
|
||||||
list.
|
list.
|
||||||
|
|
||||||
NOTE: Reclaim does not work for the root cgroup, since we cannot set any
|
NOTE:
|
||||||
limits on the root cgroup.
|
Reclaim does not work for the root cgroup, since we cannot set any
|
||||||
|
limits on the root cgroup.
|
||||||
|
|
||||||
Note2: When panic_on_oom is set to "2", the whole system will panic.
|
Note2:
|
||||||
|
When panic_on_oom is set to "2", the whole system will panic.
|
||||||
|
|
||||||
When oom event notifier is registered, event will be delivered.
|
When oom event notifier is registered, event will be delivered.
|
||||||
(See oom_control section)
|
(See oom_control section)
|
||||||
|
|
||||||
2.6 Locking
|
2.6 Locking
|
||||||
|
-----------
|
||||||
|
|
||||||
lock_page_cgroup()/unlock_page_cgroup() should not be called under
|
lock_page_cgroup()/unlock_page_cgroup() should not be called under
|
||||||
the i_pages lock.
|
the i_pages lock.
|
||||||
|
|
||||||
Other lock order is following:
|
Other lock order is following:
|
||||||
|
|
||||||
PG_locked.
|
PG_locked.
|
||||||
mm->page_table_lock
|
mm->page_table_lock
|
||||||
pgdat->lru_lock
|
pgdat->lru_lock
|
||||||
lock_page_cgroup.
|
lock_page_cgroup.
|
||||||
|
|
||||||
In many cases, just lock_page_cgroup() is called.
|
In many cases, just lock_page_cgroup() is called.
|
||||||
|
|
||||||
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
|
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
|
||||||
pgdat->lru_lock, it has no lock of its own.
|
pgdat->lru_lock, it has no lock of its own.
|
||||||
|
|
||||||
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
|
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
With the Kernel memory extension, the Memory Controller is able to limit
|
With the Kernel memory extension, the Memory Controller is able to limit
|
||||||
the amount of kernel memory used by the system. Kernel memory is fundamentally
|
the amount of kernel memory used by the system. Kernel memory is fundamentally
|
||||||
|
@ -288,6 +319,7 @@ Kernel memory limits are not imposed for the root cgroup. Usage for the root
|
||||||
cgroup may or may not be accounted. The memory used is accumulated into
|
cgroup may or may not be accounted. The memory used is accumulated into
|
||||||
memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
|
memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
|
||||||
(currently only for tcp).
|
(currently only for tcp).
|
||||||
|
|
||||||
The main "kmem" counter is fed into the main counter, so kmem charges will
|
The main "kmem" counter is fed into the main counter, so kmem charges will
|
||||||
also be visible from the user counter.
|
also be visible from the user counter.
|
||||||
|
|
||||||
|
@ -295,36 +327,42 @@ Currently no soft limit is implemented for kernel memory. It is future work
|
||||||
to trigger slab reclaim when those limits are reached.
|
to trigger slab reclaim when those limits are reached.
|
||||||
|
|
||||||
2.7.1 Current Kernel Memory resources accounted
|
2.7.1 Current Kernel Memory resources accounted
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
* stack pages: every process consumes some stack pages. By accounting into
|
stack pages:
|
||||||
kernel memory, we prevent new processes from being created when the kernel
|
every process consumes some stack pages. By accounting into
|
||||||
memory usage is too high.
|
kernel memory, we prevent new processes from being created when the kernel
|
||||||
|
memory usage is too high.
|
||||||
|
|
||||||
* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy
|
slab pages:
|
||||||
of each kmem_cache is created every time the cache is touched by the first time
|
pages allocated by the SLAB or SLUB allocator are tracked. A copy
|
||||||
from inside the memcg. The creation is done lazily, so some objects can still be
|
of each kmem_cache is created every time the cache is touched by the first time
|
||||||
skipped while the cache is being created. All objects in a slab page should
|
from inside the memcg. The creation is done lazily, so some objects can still be
|
||||||
belong to the same memcg. This only fails to hold when a task is migrated to a
|
skipped while the cache is being created. All objects in a slab page should
|
||||||
different memcg during the page allocation by the cache.
|
belong to the same memcg. This only fails to hold when a task is migrated to a
|
||||||
|
different memcg during the page allocation by the cache.
|
||||||
|
|
||||||
* sockets memory pressure: some sockets protocols have memory pressure
|
sockets memory pressure:
|
||||||
thresholds. The Memory Controller allows them to be controlled individually
|
some sockets protocols have memory pressure
|
||||||
per cgroup, instead of globally.
|
thresholds. The Memory Controller allows them to be controlled individually
|
||||||
|
per cgroup, instead of globally.
|
||||||
|
|
||||||
* tcp memory pressure: sockets memory pressure for the tcp protocol.
|
tcp memory pressure:
|
||||||
|
sockets memory pressure for the tcp protocol.
|
||||||
|
|
||||||
2.7.2 Common use cases
|
2.7.2 Common use cases
|
||||||
|
----------------------
|
||||||
|
|
||||||
Because the "kmem" counter is fed to the main user counter, kernel memory can
|
Because the "kmem" counter is fed to the main user counter, kernel memory can
|
||||||
never be limited completely independently of user memory. Say "U" is the user
|
never be limited completely independently of user memory. Say "U" is the user
|
||||||
limit, and "K" the kernel limit. There are three possible ways limits can be
|
limit, and "K" the kernel limit. There are three possible ways limits can be
|
||||||
set:
|
set:
|
||||||
|
|
||||||
U != 0, K = unlimited:
|
U != 0, K = unlimited:
|
||||||
This is the standard memcg limitation mechanism already present before kmem
|
This is the standard memcg limitation mechanism already present before kmem
|
||||||
accounting. Kernel memory is completely ignored.
|
accounting. Kernel memory is completely ignored.
|
||||||
|
|
||||||
U != 0, K < U:
|
U != 0, K < U:
|
||||||
Kernel memory is a subset of the user memory. This setup is useful in
|
Kernel memory is a subset of the user memory. This setup is useful in
|
||||||
deployments where the total amount of memory per-cgroup is overcommited.
|
deployments where the total amount of memory per-cgroup is overcommited.
|
||||||
Overcommiting kernel memory limits is definitely not recommended, since the
|
Overcommiting kernel memory limits is definitely not recommended, since the
|
||||||
|
@ -332,19 +370,23 @@ set:
|
||||||
In this case, the admin could set up K so that the sum of all groups is
|
In this case, the admin could set up K so that the sum of all groups is
|
||||||
never greater than the total memory, and freely set U at the cost of his
|
never greater than the total memory, and freely set U at the cost of his
|
||||||
QoS.
|
QoS.
|
||||||
WARNING: In the current implementation, memory reclaim will NOT be
|
|
||||||
|
WARNING:
|
||||||
|
In the current implementation, memory reclaim will NOT be
|
||||||
triggered for a cgroup when it hits K while staying below U, which makes
|
triggered for a cgroup when it hits K while staying below U, which makes
|
||||||
this setup impractical.
|
this setup impractical.
|
||||||
|
|
||||||
U != 0, K >= U:
|
U != 0, K >= U:
|
||||||
Since kmem charges will also be fed to the user counter and reclaim will be
|
Since kmem charges will also be fed to the user counter and reclaim will be
|
||||||
triggered for the cgroup for both kinds of memory. This setup gives the
|
triggered for the cgroup for both kinds of memory. This setup gives the
|
||||||
admin a unified view of memory, and it is also useful for people who just
|
admin a unified view of memory, and it is also useful for people who just
|
||||||
want to track kernel memory usage.
|
want to track kernel memory usage.
|
||||||
|
|
||||||
3. User Interface
|
3. User Interface
|
||||||
|
=================
|
||||||
|
|
||||||
3.0. Configuration
|
3.0. Configuration
|
||||||
|
------------------
|
||||||
|
|
||||||
a. Enable CONFIG_CGROUPS
|
a. Enable CONFIG_CGROUPS
|
||||||
b. Enable CONFIG_MEMCG
|
b. Enable CONFIG_MEMCG
|
||||||
|
@ -352,39 +394,53 @@ c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
|
||||||
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
|
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
|
||||||
|
|
||||||
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
|
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
|
||||||
# mount -t tmpfs none /sys/fs/cgroup
|
-------------------------------------------------------------------
|
||||||
# mkdir /sys/fs/cgroup/memory
|
|
||||||
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
|
|
||||||
|
|
||||||
3.2. Make the new group and move bash into it
|
::
|
||||||
# mkdir /sys/fs/cgroup/memory/0
|
|
||||||
# echo $$ > /sys/fs/cgroup/memory/0/tasks
|
|
||||||
|
|
||||||
Since now we're in the 0 cgroup, we can alter the memory limit:
|
# mount -t tmpfs none /sys/fs/cgroup
|
||||||
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
# mkdir /sys/fs/cgroup/memory
|
||||||
|
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
|
||||||
|
|
||||||
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
|
3.2. Make the new group and move bash into it::
|
||||||
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
|
|
||||||
|
|
||||||
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited).
|
# mkdir /sys/fs/cgroup/memory/0
|
||||||
NOTE: We cannot set limits on the root cgroup any more.
|
# echo $$ > /sys/fs/cgroup/memory/0/tasks
|
||||||
|
|
||||||
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
Since now we're in the 0 cgroup, we can alter the memory limit::
|
||||||
4194304
|
|
||||||
|
|
||||||
We can check the usage:
|
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
||||||
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
|
|
||||||
1216512
|
NOTE:
|
||||||
|
We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
|
||||||
|
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
|
||||||
|
Gibibytes.)
|
||||||
|
|
||||||
|
NOTE:
|
||||||
|
We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
|
||||||
|
|
||||||
|
NOTE:
|
||||||
|
We cannot set limits on the root cgroup any more.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
|
||||||
|
4194304
|
||||||
|
|
||||||
|
We can check the usage::
|
||||||
|
|
||||||
|
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
|
||||||
|
1216512
|
||||||
|
|
||||||
A successful write to this file does not guarantee a successful setting of
|
A successful write to this file does not guarantee a successful setting of
|
||||||
this limit to the value written into the file. This can be due to a
|
this limit to the value written into the file. This can be due to a
|
||||||
number of factors, such as rounding up to page boundaries or the total
|
number of factors, such as rounding up to page boundaries or the total
|
||||||
availability of memory on the system. The user is required to re-read
|
availability of memory on the system. The user is required to re-read
|
||||||
this file after a write to guarantee the value committed by the kernel.
|
this file after a write to guarantee the value committed by the kernel::
|
||||||
|
|
||||||
# echo 1 > memory.limit_in_bytes
|
# echo 1 > memory.limit_in_bytes
|
||||||
# cat memory.limit_in_bytes
|
# cat memory.limit_in_bytes
|
||||||
4096
|
4096
|
||||||
|
|
||||||
The memory.failcnt field gives the number of times that the cgroup limit was
|
The memory.failcnt field gives the number of times that the cgroup limit was
|
||||||
exceeded.
|
exceeded.
|
||||||
|
@ -393,6 +449,7 @@ The memory.stat file gives accounting information. Now, the number of
|
||||||
caches, RSS and Active pages/Inactive pages are shown.
|
caches, RSS and Active pages/Inactive pages are shown.
|
||||||
|
|
||||||
4. Testing
|
4. Testing
|
||||||
|
==========
|
||||||
|
|
||||||
For testing features and implementation, see memcg_test.txt.
|
For testing features and implementation, see memcg_test.txt.
|
||||||
|
|
||||||
|
@ -408,6 +465,7 @@ But the above two are testing extreme situations.
|
||||||
Trying usual test under memory controller is always helpful.
|
Trying usual test under memory controller is always helpful.
|
||||||
|
|
||||||
4.1 Troubleshooting
|
4.1 Troubleshooting
|
||||||
|
-------------------
|
||||||
|
|
||||||
Sometimes a user might find that the application under a cgroup is
|
Sometimes a user might find that the application under a cgroup is
|
||||||
terminated by the OOM killer. There are several causes for this:
|
terminated by the OOM killer. There are several causes for this:
|
||||||
|
@ -422,6 +480,7 @@ To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
|
||||||
seeing what happens will be helpful.
|
seeing what happens will be helpful.
|
||||||
|
|
||||||
4.2 Task migration
|
4.2 Task migration
|
||||||
|
------------------
|
||||||
|
|
||||||
When a task migrates from one cgroup to another, its charge is not
|
When a task migrates from one cgroup to another, its charge is not
|
||||||
carried forward by default. The pages allocated from the original cgroup still
|
carried forward by default. The pages allocated from the original cgroup still
|
||||||
|
@ -432,6 +491,7 @@ You can move charges of a task along with task migration.
|
||||||
See 8. "Move charges at task migration"
|
See 8. "Move charges at task migration"
|
||||||
|
|
||||||
4.3 Removing a cgroup
|
4.3 Removing a cgroup
|
||||||
|
---------------------
|
||||||
|
|
||||||
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
|
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
|
||||||
cgroup might have some charge associated with it, even though all
|
cgroup might have some charge associated with it, even though all
|
||||||
|
@ -448,13 +508,15 @@ will be charged as a new owner of it.
|
||||||
|
|
||||||
About use_hierarchy, see Section 6.
|
About use_hierarchy, see Section 6.
|
||||||
|
|
||||||
5. Misc. interfaces.
|
5. Misc. interfaces
|
||||||
|
===================
|
||||||
|
|
||||||
5.1 force_empty
|
5.1 force_empty
|
||||||
|
---------------
|
||||||
memory.force_empty interface is provided to make cgroup's memory usage empty.
|
memory.force_empty interface is provided to make cgroup's memory usage empty.
|
||||||
When writing anything to this
|
When writing anything to this::
|
||||||
|
|
||||||
# echo 0 > memory.force_empty
|
# echo 0 > memory.force_empty
|
||||||
|
|
||||||
the cgroup will be reclaimed and as many pages reclaimed as possible.
|
the cgroup will be reclaimed and as many pages reclaimed as possible.
|
||||||
|
|
||||||
|
@ -471,50 +533,61 @@ About use_hierarchy, see Section 6.
|
||||||
About use_hierarchy, see Section 6.
|
About use_hierarchy, see Section 6.
|
||||||
|
|
||||||
5.2 stat file
|
5.2 stat file
|
||||||
|
-------------
|
||||||
|
|
||||||
memory.stat file includes following statistics
|
memory.stat file includes following statistics
|
||||||
|
|
||||||
# per-memory cgroup local status
|
per-memory cgroup local status
|
||||||
cache - # of bytes of page cache memory.
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
rss - # of bytes of anonymous and swap cache memory (includes
|
|
||||||
|
=============== ===============================================================
|
||||||
|
cache # of bytes of page cache memory.
|
||||||
|
rss # of bytes of anonymous and swap cache memory (includes
|
||||||
transparent hugepages).
|
transparent hugepages).
|
||||||
rss_huge - # of bytes of anonymous transparent hugepages.
|
rss_huge # of bytes of anonymous transparent hugepages.
|
||||||
mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
|
mapped_file # of bytes of mapped file (includes tmpfs/shmem)
|
||||||
pgpgin - # of charging events to the memory cgroup. The charging
|
pgpgin # of charging events to the memory cgroup. The charging
|
||||||
event happens each time a page is accounted as either mapped
|
event happens each time a page is accounted as either mapped
|
||||||
anon page(RSS) or cache page(Page Cache) to the cgroup.
|
anon page(RSS) or cache page(Page Cache) to the cgroup.
|
||||||
pgpgout - # of uncharging events to the memory cgroup. The uncharging
|
pgpgout # of uncharging events to the memory cgroup. The uncharging
|
||||||
event happens each time a page is unaccounted from the cgroup.
|
event happens each time a page is unaccounted from the cgroup.
|
||||||
swap - # of bytes of swap usage
|
swap # of bytes of swap usage
|
||||||
dirty - # of bytes that are waiting to get written back to the disk.
|
dirty # of bytes that are waiting to get written back to the disk.
|
||||||
writeback - # of bytes of file/anon cache that are queued for syncing to
|
writeback # of bytes of file/anon cache that are queued for syncing to
|
||||||
disk.
|
disk.
|
||||||
inactive_anon - # of bytes of anonymous and swap cache memory on inactive
|
inactive_anon # of bytes of anonymous and swap cache memory on inactive
|
||||||
LRU list.
|
LRU list.
|
||||||
active_anon - # of bytes of anonymous and swap cache memory on active
|
active_anon # of bytes of anonymous and swap cache memory on active
|
||||||
LRU list.
|
LRU list.
|
||||||
inactive_file - # of bytes of file-backed memory on inactive LRU list.
|
inactive_file # of bytes of file-backed memory on inactive LRU list.
|
||||||
active_file - # of bytes of file-backed memory on active LRU list.
|
active_file # of bytes of file-backed memory on active LRU list.
|
||||||
unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc).
|
unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
|
||||||
|
=============== ===============================================================
|
||||||
|
|
||||||
# status considering hierarchy (see memory.use_hierarchy settings)
|
status considering hierarchy (see memory.use_hierarchy settings)
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy
|
========================= ===================================================
|
||||||
under which the memory cgroup is
|
hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
|
||||||
hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to
|
under which the memory cgroup is
|
||||||
hierarchy under which memory cgroup is.
|
hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
|
||||||
|
hierarchy under which memory cgroup is.
|
||||||
|
|
||||||
total_<counter> - # hierarchical version of <counter>, which in
|
total_<counter> # hierarchical version of <counter>, which in
|
||||||
addition to the cgroup's own value includes the
|
addition to the cgroup's own value includes the
|
||||||
sum of all hierarchical children's values of
|
sum of all hierarchical children's values of
|
||||||
<counter>, i.e. total_cache
|
<counter>, i.e. total_cache
|
||||||
|
========================= ===================================================
|
||||||
|
|
||||||
# The following additional stats are dependent on CONFIG_DEBUG_VM.
|
The following additional stats are dependent on CONFIG_DEBUG_VM
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
recent_rotated_anon - VM internal parameter. (see mm/vmscan.c)
|
========================= ========================================
|
||||||
recent_rotated_file - VM internal parameter. (see mm/vmscan.c)
|
recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
|
||||||
recent_scanned_anon - VM internal parameter. (see mm/vmscan.c)
|
recent_rotated_file VM internal parameter. (see mm/vmscan.c)
|
||||||
recent_scanned_file - VM internal parameter. (see mm/vmscan.c)
|
recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
|
||||||
|
recent_scanned_file VM internal parameter. (see mm/vmscan.c)
|
||||||
|
========================= ========================================
|
||||||
|
|
||||||
Memo:
|
Memo:
|
||||||
recent_rotated means recent frequency of LRU rotation.
|
recent_rotated means recent frequency of LRU rotation.
|
||||||
|
@ -525,12 +598,15 @@ Note:
|
||||||
Only anonymous and swap cache memory is listed as part of 'rss' stat.
|
Only anonymous and swap cache memory is listed as part of 'rss' stat.
|
||||||
This should not be confused with the true 'resident set size' or the
|
This should not be confused with the true 'resident set size' or the
|
||||||
amount of physical memory used by the cgroup.
|
amount of physical memory used by the cgroup.
|
||||||
|
|
||||||
'rss + mapped_file" will give you resident set size of cgroup.
|
'rss + mapped_file" will give you resident set size of cgroup.
|
||||||
|
|
||||||
(Note: file and shmem may be shared among other cgroups. In that case,
|
(Note: file and shmem may be shared among other cgroups. In that case,
|
||||||
mapped_file is accounted only when the memory cgroup is owner of page
|
mapped_file is accounted only when the memory cgroup is owner of page
|
||||||
cache.)
|
cache.)
|
||||||
|
|
||||||
5.3 swappiness
|
5.3 swappiness
|
||||||
|
--------------
|
||||||
|
|
||||||
Overrides /proc/sys/vm/swappiness for the particular group. The tunable
|
Overrides /proc/sys/vm/swappiness for the particular group. The tunable
|
||||||
in the root cgroup corresponds to the global swappiness setting.
|
in the root cgroup corresponds to the global swappiness setting.
|
||||||
|
@ -541,16 +617,19 @@ there is a swap storage available. This might lead to memcg OOM killer
|
||||||
if there are no file pages to reclaim.
|
if there are no file pages to reclaim.
|
||||||
|
|
||||||
5.4 failcnt
|
5.4 failcnt
|
||||||
|
-----------
|
||||||
|
|
||||||
A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
|
A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
|
||||||
This failcnt(== failure count) shows the number of times that a usage counter
|
This failcnt(== failure count) shows the number of times that a usage counter
|
||||||
hit its limit. When a memory cgroup hits a limit, failcnt increases and
|
hit its limit. When a memory cgroup hits a limit, failcnt increases and
|
||||||
memory under it will be reclaimed.
|
memory under it will be reclaimed.
|
||||||
|
|
||||||
You can reset failcnt by writing 0 to failcnt file.
|
You can reset failcnt by writing 0 to failcnt file::
|
||||||
# echo 0 > .../memory.failcnt
|
|
||||||
|
# echo 0 > .../memory.failcnt
|
||||||
|
|
||||||
5.5 usage_in_bytes
|
5.5 usage_in_bytes
|
||||||
|
------------------
|
||||||
|
|
||||||
For efficiency, as other kernel components, memory cgroup uses some optimization
|
For efficiency, as other kernel components, memory cgroup uses some optimization
|
||||||
to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
|
to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
|
||||||
|
@ -560,6 +639,7 @@ If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
|
||||||
value in memory.stat(see 5.2).
|
value in memory.stat(see 5.2).
|
||||||
|
|
||||||
5.6 numa_stat
|
5.6 numa_stat
|
||||||
|
-------------
|
||||||
|
|
||||||
This is similar to numa_maps but operates on a per-memcg basis. This is
|
This is similar to numa_maps but operates on a per-memcg basis. This is
|
||||||
useful for providing visibility into the numa locality information within
|
useful for providing visibility into the numa locality information within
|
||||||
|
@ -571,22 +651,23 @@ Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
|
||||||
per-node page counts including "hierarchical_<counter>" which sums up all
|
per-node page counts including "hierarchical_<counter>" which sums up all
|
||||||
hierarchical children's values in addition to the memcg's own value.
|
hierarchical children's values in addition to the memcg's own value.
|
||||||
|
|
||||||
The output format of memory.numa_stat is:
|
The output format of memory.numa_stat is::
|
||||||
|
|
||||||
total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||||
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||||
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||||
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||||
hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
|
||||||
|
|
||||||
The "total" count is sum of file + anon + unevictable.
|
The "total" count is sum of file + anon + unevictable.
|
||||||
|
|
||||||
6. Hierarchy support
|
6. Hierarchy support
|
||||||
|
====================
|
||||||
|
|
||||||
The memory controller supports a deep hierarchy and hierarchical accounting.
|
The memory controller supports a deep hierarchy and hierarchical accounting.
|
||||||
The hierarchy is created by creating the appropriate cgroups in the
|
The hierarchy is created by creating the appropriate cgroups in the
|
||||||
cgroup filesystem. Consider for example, the following cgroup filesystem
|
cgroup filesystem. Consider for example, the following cgroup filesystem
|
||||||
hierarchy
|
hierarchy::
|
||||||
|
|
||||||
root
|
root
|
||||||
/ | \
|
/ | \
|
||||||
|
@ -603,24 +684,28 @@ limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
|
||||||
children of the ancestor.
|
children of the ancestor.
|
||||||
|
|
||||||
6.1 Enabling hierarchical accounting and reclaim
|
6.1 Enabling hierarchical accounting and reclaim
|
||||||
|
------------------------------------------------
|
||||||
|
|
||||||
A memory cgroup by default disables the hierarchy feature. Support
|
A memory cgroup by default disables the hierarchy feature. Support
|
||||||
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
|
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
|
||||||
|
|
||||||
# echo 1 > memory.use_hierarchy
|
# echo 1 > memory.use_hierarchy
|
||||||
|
|
||||||
The feature can be disabled by
|
The feature can be disabled by::
|
||||||
|
|
||||||
# echo 0 > memory.use_hierarchy
|
# echo 0 > memory.use_hierarchy
|
||||||
|
|
||||||
NOTE1: Enabling/disabling will fail if either the cgroup already has other
|
NOTE1:
|
||||||
|
Enabling/disabling will fail if either the cgroup already has other
|
||||||
cgroups created below it, or if the parent cgroup has use_hierarchy
|
cgroups created below it, or if the parent cgroup has use_hierarchy
|
||||||
enabled.
|
enabled.
|
||||||
|
|
||||||
NOTE2: When panic_on_oom is set to "2", the whole system will panic in
|
NOTE2:
|
||||||
|
When panic_on_oom is set to "2", the whole system will panic in
|
||||||
case of an OOM event in any cgroup.
|
case of an OOM event in any cgroup.
|
||||||
|
|
||||||
7. Soft limits
|
7. Soft limits
|
||||||
|
==============
|
||||||
|
|
||||||
Soft limits allow for greater sharing of memory. The idea behind soft limits
|
Soft limits allow for greater sharing of memory. The idea behind soft limits
|
||||||
is to allow control groups to use as much of the memory as needed, provided
|
is to allow control groups to use as much of the memory as needed, provided
|
||||||
|
@ -640,22 +725,26 @@ hints/setup. Currently soft limit based reclaim is set up such that
|
||||||
it gets invoked from balance_pgdat (kswapd).
|
it gets invoked from balance_pgdat (kswapd).
|
||||||
|
|
||||||
7.1 Interface
|
7.1 Interface
|
||||||
|
-------------
|
||||||
|
|
||||||
Soft limits can be setup by using the following commands (in this example we
|
Soft limits can be setup by using the following commands (in this example we
|
||||||
assume a soft limit of 256 MiB)
|
assume a soft limit of 256 MiB)::
|
||||||
|
|
||||||
# echo 256M > memory.soft_limit_in_bytes
|
# echo 256M > memory.soft_limit_in_bytes
|
||||||
|
|
||||||
If we want to change this to 1G, we can at any time use
|
If we want to change this to 1G, we can at any time use::
|
||||||
|
|
||||||
# echo 1G > memory.soft_limit_in_bytes
|
# echo 1G > memory.soft_limit_in_bytes
|
||||||
|
|
||||||
NOTE1: Soft limits take effect over a long period of time, since they involve
|
NOTE1:
|
||||||
|
Soft limits take effect over a long period of time, since they involve
|
||||||
reclaiming memory for balancing between memory cgroups
|
reclaiming memory for balancing between memory cgroups
|
||||||
NOTE2: It is recommended to set the soft limit always below the hard limit,
|
NOTE2:
|
||||||
|
It is recommended to set the soft limit always below the hard limit,
|
||||||
otherwise the hard limit will take precedence.
|
otherwise the hard limit will take precedence.
|
||||||
|
|
||||||
8. Move charges at task migration
|
8. Move charges at task migration
|
||||||
|
=================================
|
||||||
|
|
||||||
Users can move charges associated with a task along with task migration, that
|
Users can move charges associated with a task along with task migration, that
|
||||||
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
|
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
|
||||||
|
@ -663,60 +752,71 @@ This feature is not supported in !CONFIG_MMU environments because of lack of
|
||||||
page tables.
|
page tables.
|
||||||
|
|
||||||
8.1 Interface
|
8.1 Interface
|
||||||
|
-------------
|
||||||
|
|
||||||
This feature is disabled by default. It can be enabled (and disabled again) by
|
This feature is disabled by default. It can be enabled (and disabled again) by
|
||||||
writing to memory.move_charge_at_immigrate of the destination cgroup.
|
writing to memory.move_charge_at_immigrate of the destination cgroup.
|
||||||
|
|
||||||
If you want to enable it:
|
If you want to enable it::
|
||||||
|
|
||||||
# echo (some positive value) > memory.move_charge_at_immigrate
|
# echo (some positive value) > memory.move_charge_at_immigrate
|
||||||
|
|
||||||
Note: Each bits of move_charge_at_immigrate has its own meaning about what type
|
Note:
|
||||||
|
Each bits of move_charge_at_immigrate has its own meaning about what type
|
||||||
of charges should be moved. See 8.2 for details.
|
of charges should be moved. See 8.2 for details.
|
||||||
Note: Charges are moved only when you move mm->owner, in other words,
|
Note:
|
||||||
|
Charges are moved only when you move mm->owner, in other words,
|
||||||
a leader of a thread group.
|
a leader of a thread group.
|
||||||
Note: If we cannot find enough space for the task in the destination cgroup, we
|
Note:
|
||||||
|
If we cannot find enough space for the task in the destination cgroup, we
|
||||||
try to make space by reclaiming memory. Task migration may fail if we
|
try to make space by reclaiming memory. Task migration may fail if we
|
||||||
cannot make enough space.
|
cannot make enough space.
|
||||||
Note: It can take several seconds if you move charges much.
|
Note:
|
||||||
|
It can take several seconds if you move charges much.
|
||||||
|
|
||||||
And if you want disable it again:
|
And if you want disable it again::
|
||||||
|
|
||||||
# echo 0 > memory.move_charge_at_immigrate
|
# echo 0 > memory.move_charge_at_immigrate
|
||||||
|
|
||||||
8.2 Type of charges which can be moved
|
8.2 Type of charges which can be moved
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
Each bit in move_charge_at_immigrate has its own meaning about what type of
|
Each bit in move_charge_at_immigrate has its own meaning about what type of
|
||||||
charges should be moved. But in any case, it must be noted that an account of
|
charges should be moved. But in any case, it must be noted that an account of
|
||||||
a page or a swap can be moved only when it is charged to the task's current
|
a page or a swap can be moved only when it is charged to the task's current
|
||||||
(old) memory cgroup.
|
(old) memory cgroup.
|
||||||
|
|
||||||
bit | what type of charges would be moved ?
|
+---+--------------------------------------------------------------------------+
|
||||||
-----+------------------------------------------------------------------------
|
|bit| what type of charges would be moved ? |
|
||||||
0 | A charge of an anonymous page (or swap of it) used by the target task.
|
+===+==========================================================================+
|
||||||
| You must enable Swap Extension (see 2.4) to enable move of swap charges.
|
| 0 | A charge of an anonymous page (or swap of it) used by the target task. |
|
||||||
-----+------------------------------------------------------------------------
|
| | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
|
||||||
1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory)
|
+---+--------------------------------------------------------------------------+
|
||||||
| and swaps of tmpfs file) mmapped by the target task. Unlike the case of
|
| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
|
||||||
| anonymous pages, file pages (and swaps) in the range mmapped by the task
|
| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
|
||||||
| will be moved even if the task hasn't done page fault, i.e. they might
|
| | anonymous pages, file pages (and swaps) in the range mmapped by the task |
|
||||||
| not be the task's "RSS", but other task's "RSS" that maps the same file.
|
| | will be moved even if the task hasn't done page fault, i.e. they might |
|
||||||
| And mapcount of the page is ignored (the page can be moved even if
|
| | not be the task's "RSS", but other task's "RSS" that maps the same file. |
|
||||||
| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
|
| | And mapcount of the page is ignored (the page can be moved even if |
|
||||||
| enable move of swap charges.
|
| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
|
||||||
|
| | enable move of swap charges. |
|
||||||
|
+---+--------------------------------------------------------------------------+
|
||||||
|
|
||||||
8.3 TODO
|
8.3 TODO
|
||||||
|
--------
|
||||||
|
|
||||||
- All of moving charge operations are done under cgroup_mutex. It's not good
|
- All of moving charge operations are done under cgroup_mutex. It's not good
|
||||||
behavior to hold the mutex too long, so we may need some trick.
|
behavior to hold the mutex too long, so we may need some trick.
|
||||||
|
|
||||||
9. Memory thresholds
|
9. Memory thresholds
|
||||||
|
====================
|
||||||
|
|
||||||
Memory cgroup implements memory thresholds using the cgroups notification
|
Memory cgroup implements memory thresholds using the cgroups notification
|
||||||
API (see cgroups.txt). It allows to register multiple memory and memsw
|
API (see cgroups.txt). It allows to register multiple memory and memsw
|
||||||
thresholds and gets notifications when it crosses.
|
thresholds and gets notifications when it crosses.
|
||||||
|
|
||||||
To register a threshold, an application must:
|
To register a threshold, an application must:
|
||||||
|
|
||||||
- create an eventfd using eventfd(2);
|
- create an eventfd using eventfd(2);
|
||||||
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
|
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
|
||||||
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
|
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
|
||||||
|
@ -728,6 +828,7 @@ threshold in any direction.
|
||||||
It's applicable for root and non-root cgroup.
|
It's applicable for root and non-root cgroup.
|
||||||
|
|
||||||
10. OOM Control
|
10. OOM Control
|
||||||
|
===============
|
||||||
|
|
||||||
memory.oom_control file is for OOM notification and other controls.
|
memory.oom_control file is for OOM notification and other controls.
|
||||||
|
|
||||||
|
@ -736,6 +837,7 @@ API (See cgroups.txt). It allows to register multiple OOM notification
|
||||||
delivery and gets notification when OOM happens.
|
delivery and gets notification when OOM happens.
|
||||||
|
|
||||||
To register a notifier, an application must:
|
To register a notifier, an application must:
|
||||||
|
|
||||||
- create an eventfd using eventfd(2)
|
- create an eventfd using eventfd(2)
|
||||||
- open memory.oom_control file
|
- open memory.oom_control file
|
||||||
- write string like "<event_fd> <fd of memory.oom_control>" to
|
- write string like "<event_fd> <fd of memory.oom_control>" to
|
||||||
|
@ -752,8 +854,11 @@ If OOM-killer is disabled, tasks under cgroup will hang/sleep
|
||||||
in memory cgroup's OOM-waitqueue when they request accountable memory.
|
in memory cgroup's OOM-waitqueue when they request accountable memory.
|
||||||
|
|
||||||
For running them, you have to relax the memory cgroup's OOM status by
|
For running them, you have to relax the memory cgroup's OOM status by
|
||||||
|
|
||||||
* enlarge limit or reduce usage.
|
* enlarge limit or reduce usage.
|
||||||
|
|
||||||
To reduce usage,
|
To reduce usage,
|
||||||
|
|
||||||
* kill some tasks.
|
* kill some tasks.
|
||||||
* move some tasks to other group with account migration.
|
* move some tasks to other group with account migration.
|
||||||
* remove some files (on tmpfs?)
|
* remove some files (on tmpfs?)
|
||||||
|
@ -761,11 +866,14 @@ To reduce usage,
|
||||||
Then, stopped tasks will work again.
|
Then, stopped tasks will work again.
|
||||||
|
|
||||||
At reading, current status of OOM is shown.
|
At reading, current status of OOM is shown.
|
||||||
oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
|
|
||||||
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
|
- oom_kill_disable 0 or 1
|
||||||
be stopped.)
|
(if 1, oom-killer is disabled)
|
||||||
|
- under_oom 0 or 1
|
||||||
|
(if 1, the memory cgroup is under OOM, tasks may be stopped.)
|
||||||
|
|
||||||
11. Memory Pressure
|
11. Memory Pressure
|
||||||
|
===================
|
||||||
|
|
||||||
The pressure level notifications can be used to monitor the memory
|
The pressure level notifications can be used to monitor the memory
|
||||||
allocation cost; based on the pressure, applications can implement
|
allocation cost; based on the pressure, applications can implement
|
||||||
|
@ -840,21 +948,22 @@ Test:
|
||||||
|
|
||||||
Here is a small script example that makes a new cgroup, sets up a
|
Here is a small script example that makes a new cgroup, sets up a
|
||||||
memory limit, sets up a notification in the cgroup and then makes child
|
memory limit, sets up a notification in the cgroup and then makes child
|
||||||
cgroup experience a critical pressure:
|
cgroup experience a critical pressure::
|
||||||
|
|
||||||
# cd /sys/fs/cgroup/memory/
|
# cd /sys/fs/cgroup/memory/
|
||||||
# mkdir foo
|
# mkdir foo
|
||||||
# cd foo
|
# cd foo
|
||||||
# cgroup_event_listener memory.pressure_level low,hierarchy &
|
# cgroup_event_listener memory.pressure_level low,hierarchy &
|
||||||
# echo 8000000 > memory.limit_in_bytes
|
# echo 8000000 > memory.limit_in_bytes
|
||||||
# echo 8000000 > memory.memsw.limit_in_bytes
|
# echo 8000000 > memory.memsw.limit_in_bytes
|
||||||
# echo $$ > tasks
|
# echo $$ > tasks
|
||||||
# dd if=/dev/zero | read x
|
# dd if=/dev/zero | read x
|
||||||
|
|
||||||
(Expect a bunch of notifications, and eventually, the oom-killer will
|
(Expect a bunch of notifications, and eventually, the oom-killer will
|
||||||
trigger.)
|
trigger.)
|
||||||
|
|
||||||
12. TODO
|
12. TODO
|
||||||
|
========
|
||||||
|
|
||||||
1. Make per-cgroup scanner reclaim not-shared pages first
|
1. Make per-cgroup scanner reclaim not-shared pages first
|
||||||
2. Teach controller to account for shared-pages
|
2. Teach controller to account for shared-pages
|
||||||
|
@ -862,11 +971,13 @@ Test:
|
||||||
not yet hit but the usage is getting closer
|
not yet hit but the usage is getting closer
|
||||||
|
|
||||||
Summary
|
Summary
|
||||||
|
=======
|
||||||
|
|
||||||
Overall, the memory controller has been a stable controller and has been
|
Overall, the memory controller has been a stable controller and has been
|
||||||
commented and discussed quite extensively in the community.
|
commented and discussed quite extensively in the community.
|
||||||
|
|
||||||
References
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
|
1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
|
||||||
2. Singh, Balbir. Memory Controller (RSS Control),
|
2. Singh, Balbir. Memory Controller (RSS Control),
|
|
@ -1,5 +1,6 @@
|
||||||
|
=========================
|
||||||
Network classifier cgroup
|
Network classifier cgroup
|
||||||
-------------------------
|
=========================
|
||||||
|
|
||||||
The Network classifier cgroup provides an interface to
|
The Network classifier cgroup provides an interface to
|
||||||
tag network packets with a class identifier (classid).
|
tag network packets with a class identifier (classid).
|
||||||
|
@ -17,23 +18,27 @@ values is 0xAAAABBBB; AAAA is the major handle number and BBBB
|
||||||
is the minor handle number.
|
is the minor handle number.
|
||||||
Reading net_cls.classid yields a decimal result.
|
Reading net_cls.classid yields a decimal result.
|
||||||
|
|
||||||
Example:
|
Example::
|
||||||
mkdir /sys/fs/cgroup/net_cls
|
|
||||||
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
|
|
||||||
mkdir /sys/fs/cgroup/net_cls/0
|
|
||||||
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
|
|
||||||
- setting a 10:1 handle.
|
|
||||||
|
|
||||||
cat /sys/fs/cgroup/net_cls/0/net_cls.classid
|
mkdir /sys/fs/cgroup/net_cls
|
||||||
1048577
|
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
|
||||||
|
mkdir /sys/fs/cgroup/net_cls/0
|
||||||
|
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
|
||||||
|
|
||||||
configuring tc:
|
- setting a 10:1 handle::
|
||||||
tc qdisc add dev eth0 root handle 10: htb
|
|
||||||
|
|
||||||
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
|
cat /sys/fs/cgroup/net_cls/0/net_cls.classid
|
||||||
- creating traffic class 10:1
|
1048577
|
||||||
|
|
||||||
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
|
- configuring tc::
|
||||||
|
|
||||||
configuring iptables, basic example:
|
tc qdisc add dev eth0 root handle 10: htb
|
||||||
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP
|
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
|
||||||
|
|
||||||
|
- creating traffic class 10:1::
|
||||||
|
|
||||||
|
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
|
||||||
|
|
||||||
|
configuring iptables, basic example::
|
||||||
|
|
||||||
|
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP
|
|
@ -1,5 +1,6 @@
|
||||||
|
=======================
|
||||||
Network priority cgroup
|
Network priority cgroup
|
||||||
-------------------------
|
=======================
|
||||||
|
|
||||||
The Network priority cgroup provides an interface to allow an administrator to
|
The Network priority cgroup provides an interface to allow an administrator to
|
||||||
dynamically set the priority of network traffic generated by various
|
dynamically set the priority of network traffic generated by various
|
||||||
|
@ -14,9 +15,9 @@ SO_PRIORITY socket option. This however, is not always possible because:
|
||||||
|
|
||||||
This cgroup allows an administrator to assign a process to a group which defines
|
This cgroup allows an administrator to assign a process to a group which defines
|
||||||
the priority of egress traffic on a given interface. Network priority groups can
|
the priority of egress traffic on a given interface. Network priority groups can
|
||||||
be created by first mounting the cgroup filesystem.
|
be created by first mounting the cgroup filesystem::
|
||||||
|
|
||||||
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
|
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
|
||||||
|
|
||||||
With the above step, the initial group acting as the parent accounting group
|
With the above step, the initial group acting as the parent accounting group
|
||||||
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
|
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
|
||||||
|
@ -25,17 +26,18 @@ the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
|
||||||
Each net_prio cgroup contains two files that are subsystem specific
|
Each net_prio cgroup contains two files that are subsystem specific
|
||||||
|
|
||||||
net_prio.prioidx
|
net_prio.prioidx
|
||||||
This file is read-only, and is simply informative. It contains a unique integer
|
This file is read-only, and is simply informative. It contains a unique
|
||||||
value that the kernel uses as an internal representation of this cgroup.
|
integer value that the kernel uses as an internal representation of this
|
||||||
|
cgroup.
|
||||||
|
|
||||||
net_prio.ifpriomap
|
net_prio.ifpriomap
|
||||||
This file contains a map of the priorities assigned to traffic originating from
|
This file contains a map of the priorities assigned to traffic originating
|
||||||
processes in this group and egressing the system on various interfaces. It
|
from processes in this group and egressing the system on various interfaces.
|
||||||
contains a list of tuples in the form <ifname priority>. Contents of this file
|
It contains a list of tuples in the form <ifname priority>. Contents of this
|
||||||
can be modified by echoing a string into the file using the same tuple format.
|
file can be modified by echoing a string into the file using the same tuple
|
||||||
for example:
|
format. For example::
|
||||||
|
|
||||||
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
|
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
|
||||||
|
|
||||||
This command would force any traffic originating from processes belonging to the
|
This command would force any traffic originating from processes belonging to the
|
||||||
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
|
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
|
|
@ -1,5 +1,6 @@
|
||||||
Process Number Controller
|
=========================
|
||||||
=========================
|
Process Number Controller
|
||||||
|
=========================
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
--------
|
--------
|
||||||
|
@ -34,55 +35,58 @@ pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
|
||||||
superset of parent/child/pids.current.
|
superset of parent/child/pids.current.
|
||||||
|
|
||||||
The pids.events file contains event counters:
|
The pids.events file contains event counters:
|
||||||
|
|
||||||
- max: Number of times fork failed because limit was hit.
|
- max: Number of times fork failed because limit was hit.
|
||||||
|
|
||||||
Example
|
Example
|
||||||
-------
|
-------
|
||||||
|
|
||||||
First, we mount the pids controller:
|
First, we mount the pids controller::
|
||||||
# mkdir -p /sys/fs/cgroup/pids
|
|
||||||
# mount -t cgroup -o pids none /sys/fs/cgroup/pids
|
|
||||||
|
|
||||||
Then we create a hierarchy, set limits and attach processes to it:
|
# mkdir -p /sys/fs/cgroup/pids
|
||||||
# mkdir -p /sys/fs/cgroup/pids/parent/child
|
# mount -t cgroup -o pids none /sys/fs/cgroup/pids
|
||||||
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max
|
|
||||||
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
|
Then we create a hierarchy, set limits and attach processes to it::
|
||||||
# cat /sys/fs/cgroup/pids/parent/pids.current
|
|
||||||
2
|
# mkdir -p /sys/fs/cgroup/pids/parent/child
|
||||||
#
|
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max
|
||||||
|
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
|
||||||
|
# cat /sys/fs/cgroup/pids/parent/pids.current
|
||||||
|
2
|
||||||
|
#
|
||||||
|
|
||||||
It should be noted that attempts to overcome the set limit (2 in this case) will
|
It should be noted that attempts to overcome the set limit (2 in this case) will
|
||||||
fail:
|
fail::
|
||||||
|
|
||||||
# cat /sys/fs/cgroup/pids/parent/pids.current
|
# cat /sys/fs/cgroup/pids/parent/pids.current
|
||||||
2
|
2
|
||||||
# ( /bin/echo "Here's some processes for you." | cat )
|
# ( /bin/echo "Here's some processes for you." | cat )
|
||||||
sh: fork: Resource temporary unavailable
|
sh: fork: Resource temporary unavailable
|
||||||
#
|
#
|
||||||
|
|
||||||
Even if we migrate to a child cgroup (which doesn't have a set limit), we will
|
Even if we migrate to a child cgroup (which doesn't have a set limit), we will
|
||||||
not be able to overcome the most stringent limit in the hierarchy (in this case,
|
not be able to overcome the most stringent limit in the hierarchy (in this case,
|
||||||
parent's):
|
parent's)::
|
||||||
|
|
||||||
# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
|
# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
|
||||||
# cat /sys/fs/cgroup/pids/parent/pids.current
|
# cat /sys/fs/cgroup/pids/parent/pids.current
|
||||||
2
|
2
|
||||||
# cat /sys/fs/cgroup/pids/parent/child/pids.current
|
# cat /sys/fs/cgroup/pids/parent/child/pids.current
|
||||||
2
|
2
|
||||||
# cat /sys/fs/cgroup/pids/parent/child/pids.max
|
# cat /sys/fs/cgroup/pids/parent/child/pids.max
|
||||||
max
|
max
|
||||||
# ( /bin/echo "Here's some processes for you." | cat )
|
# ( /bin/echo "Here's some processes for you." | cat )
|
||||||
sh: fork: Resource temporary unavailable
|
sh: fork: Resource temporary unavailable
|
||||||
#
|
#
|
||||||
|
|
||||||
We can set a limit that is smaller than pids.current, which will stop any new
|
We can set a limit that is smaller than pids.current, which will stop any new
|
||||||
processes from being forked at all (note that the shell itself counts towards
|
processes from being forked at all (note that the shell itself counts towards
|
||||||
pids.current):
|
pids.current)::
|
||||||
|
|
||||||
# echo 1 > /sys/fs/cgroup/pids/parent/pids.max
|
# echo 1 > /sys/fs/cgroup/pids/parent/pids.max
|
||||||
# /bin/echo "We can't even spawn a single process now."
|
# /bin/echo "We can't even spawn a single process now."
|
||||||
sh: fork: Resource temporary unavailable
|
sh: fork: Resource temporary unavailable
|
||||||
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max
|
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max
|
||||||
# /bin/echo "We can't even spawn a single process now."
|
# /bin/echo "We can't even spawn a single process now."
|
||||||
sh: fork: Resource temporary unavailable
|
sh: fork: Resource temporary unavailable
|
||||||
#
|
#
|
|
@ -1,16 +1,17 @@
|
||||||
RDMA Controller
|
===============
|
||||||
----------------
|
RDMA Controller
|
||||||
|
===============
|
||||||
|
|
||||||
Contents
|
.. Contents
|
||||||
--------
|
|
||||||
|
1. Overview
|
||||||
1. Overview
|
1-1. What is RDMA controller?
|
||||||
1-1. What is RDMA controller?
|
1-2. Why RDMA controller needed?
|
||||||
1-2. Why RDMA controller needed?
|
1-3. How is RDMA controller implemented?
|
||||||
1-3. How is RDMA controller implemented?
|
2. Usage Examples
|
||||||
2. Usage Examples
|
|
||||||
|
|
||||||
1. Overview
|
1. Overview
|
||||||
|
===========
|
||||||
|
|
||||||
1-1. What is RDMA controller?
|
1-1. What is RDMA controller?
|
||||||
-----------------------------
|
-----------------------------
|
||||||
|
@ -83,27 +84,34 @@ what is configured by user for a given cgroup and what is supported by
|
||||||
IB device.
|
IB device.
|
||||||
|
|
||||||
Following resources can be accounted by rdma controller.
|
Following resources can be accounted by rdma controller.
|
||||||
|
|
||||||
|
========== =============================
|
||||||
hca_handle Maximum number of HCA Handles
|
hca_handle Maximum number of HCA Handles
|
||||||
hca_object Maximum number of HCA Objects
|
hca_object Maximum number of HCA Objects
|
||||||
|
========== =============================
|
||||||
|
|
||||||
2. Usage Examples
|
2. Usage Examples
|
||||||
-----------------
|
=================
|
||||||
|
|
||||||
(a) Configure resource limit:
|
(a) Configure resource limit::
|
||||||
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
|
|
||||||
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
|
|
||||||
|
|
||||||
(b) Query resource limit:
|
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
|
||||||
cat /sys/fs/cgroup/rdma/2/rdma.max
|
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
|
||||||
#Output:
|
|
||||||
mlx4_0 hca_handle=2 hca_object=2000
|
|
||||||
ocrdma1 hca_handle=3 hca_object=max
|
|
||||||
|
|
||||||
(c) Query current usage:
|
(b) Query resource limit::
|
||||||
cat /sys/fs/cgroup/rdma/2/rdma.current
|
|
||||||
#Output:
|
|
||||||
mlx4_0 hca_handle=1 hca_object=20
|
|
||||||
ocrdma1 hca_handle=1 hca_object=23
|
|
||||||
|
|
||||||
(d) Delete resource limit:
|
cat /sys/fs/cgroup/rdma/2/rdma.max
|
||||||
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
|
#Output:
|
||||||
|
mlx4_0 hca_handle=2 hca_object=2000
|
||||||
|
ocrdma1 hca_handle=3 hca_object=max
|
||||||
|
|
||||||
|
(c) Query current usage::
|
||||||
|
|
||||||
|
cat /sys/fs/cgroup/rdma/2/rdma.current
|
||||||
|
#Output:
|
||||||
|
mlx4_0 hca_handle=1 hca_object=20
|
||||||
|
ocrdma1 hca_handle=1 hca_object=23
|
||||||
|
|
||||||
|
(d) Delete resource limit::
|
||||||
|
|
||||||
|
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
|
|
@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for
|
||||||
use at file creation time. When a task allocates a file in the file
|
use at file creation time. When a task allocates a file in the file
|
||||||
system, the mount option memory policy will be applied with a NodeList,
|
system, the mount option memory policy will be applied with a NodeList,
|
||||||
if any, modified by the calling task's cpuset constraints
|
if any, modified by the calling task's cpuset constraints
|
||||||
[See Documentation/cgroup-v1/cpusets.txt] and any optional flags, listed
|
[See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed
|
||||||
below. If the resulting NodeLists is the empty set, the effective memory
|
below. If the resulting NodeLists is the empty set, the effective memory
|
||||||
policy for the file will revert to "default" policy.
|
policy for the file will revert to "default" policy.
|
||||||
|
|
||||||
|
|
|
@ -652,7 +652,7 @@ CONTENTS
|
||||||
|
|
||||||
-deadline tasks cannot have an affinity mask smaller that the entire
|
-deadline tasks cannot have an affinity mask smaller that the entire
|
||||||
root_domain they are created on. However, affinities can be specified
|
root_domain they are created on. However, affinities can be specified
|
||||||
through the cpuset facility (Documentation/cgroup-v1/cpusets.txt).
|
through the cpuset facility (Documentation/cgroup-v1/cpusets.rst).
|
||||||
|
|
||||||
5.1 SCHED_DEADLINE and cpusets HOWTO
|
5.1 SCHED_DEADLINE and cpusets HOWTO
|
||||||
------------------------------------
|
------------------------------------
|
||||||
|
|
|
@ -215,7 +215,7 @@ SCHED_BATCH) tasks.
|
||||||
|
|
||||||
These options need CONFIG_CGROUPS to be defined, and let the administrator
|
These options need CONFIG_CGROUPS to be defined, and let the administrator
|
||||||
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
|
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
|
||||||
Documentation/cgroup-v1/cgroups.txt for more information about this filesystem.
|
Documentation/cgroup-v1/cgroups.rst for more information about this filesystem.
|
||||||
|
|
||||||
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
|
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
|
||||||
group created using the pseudo filesystem. See example steps below to create
|
group created using the pseudo filesystem. See example steps below to create
|
||||||
|
|
|
@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
|
||||||
to control the CPU time reserved for each control group.
|
to control the CPU time reserved for each control group.
|
||||||
|
|
||||||
For more information on working with control groups, you should read
|
For more information on working with control groups, you should read
|
||||||
Documentation/cgroup-v1/cgroups.txt as well.
|
Documentation/cgroup-v1/cgroups.rst as well.
|
||||||
|
|
||||||
Group settings are checked against the following limits in order to keep the
|
Group settings are checked against the following limits in order to keep the
|
||||||
configuration schedulable:
|
configuration schedulable:
|
||||||
|
|
|
@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells'
|
||||||
physical memory. NUMA emluation is useful for testing NUMA kernel and
|
physical memory. NUMA emluation is useful for testing NUMA kernel and
|
||||||
application features on non-NUMA platforms, and as a sort of memory resource
|
application features on non-NUMA platforms, and as a sort of memory resource
|
||||||
management mechanism when used together with cpusets.
|
management mechanism when used together with cpusets.
|
||||||
[see Documentation/cgroup-v1/cpusets.txt]
|
[see Documentation/cgroup-v1/cpusets.rst]
|
||||||
|
|
||||||
For each node with memory, Linux constructs an independent memory management
|
For each node with memory, Linux constructs an independent memory management
|
||||||
subsystem, complete with its own free page lists, in-use page lists, usage
|
subsystem, complete with its own free page lists, in-use page lists, usage
|
||||||
|
@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see
|
||||||
|
|
||||||
System administrators can restrict the CPUs and nodes' memories that a non-
|
System administrators can restrict the CPUs and nodes' memories that a non-
|
||||||
privileged user can specify in the scheduling or NUMA commands and functions
|
privileged user can specify in the scheduling or NUMA commands and functions
|
||||||
using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt]
|
using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst]
|
||||||
|
|
||||||
On architectures that do not hide memoryless nodes, Linux will include only
|
On architectures that do not hide memoryless nodes, Linux will include only
|
||||||
zones [nodes] with memory in the zonelists. This means that for a memoryless
|
zones [nodes] with memory in the zonelists. This means that for a memoryless
|
||||||
|
|
|
@ -41,7 +41,7 @@ locations.
|
||||||
Larger installations usually partition the system using cpusets into
|
Larger installations usually partition the system using cpusets into
|
||||||
sections of nodes. Paul Jackson has equipped cpusets with the ability to
|
sections of nodes. Paul Jackson has equipped cpusets with the ability to
|
||||||
move pages when a task is moved to another cpuset (See
|
move pages when a task is moved to another cpuset (See
|
||||||
Documentation/cgroup-v1/cpusets.txt).
|
Documentation/cgroup-v1/cpusets.rst).
|
||||||
Cpusets allows the automation of process locality. If a task is moved to
|
Cpusets allows the automation of process locality. If a task is moved to
|
||||||
a new cpuset then also all its pages are moved with it so that the
|
a new cpuset then also all its pages are moved with it so that the
|
||||||
performance of the process does not sink dramatically. Also the pages
|
performance of the process does not sink dramatically. Also the pages
|
||||||
|
|
|
@ -98,7 +98,7 @@ Memory Control Group Interaction
|
||||||
--------------------------------
|
--------------------------------
|
||||||
|
|
||||||
The unevictable LRU facility interacts with the memory control group [aka
|
The unevictable LRU facility interacts with the memory control group [aka
|
||||||
memory controller; see Documentation/cgroup-v1/memory.txt] by extending the
|
memory controller; see Documentation/cgroup-v1/memory.rst] by extending the
|
||||||
lru_list enum.
|
lru_list enum.
|
||||||
|
|
||||||
The memory controller data structure automatically gets a per-zone unevictable
|
The memory controller data structure automatically gets a per-zone unevictable
|
||||||
|
|
|
@ -15,7 +15,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the
|
||||||
amount of system memory that are available to a certain class of tasks.
|
amount of system memory that are available to a certain class of tasks.
|
||||||
|
|
||||||
For more information on the features of cpusets, see
|
For more information on the features of cpusets, see
|
||||||
Documentation/cgroup-v1/cpusets.txt.
|
Documentation/cgroup-v1/cpusets.rst.
|
||||||
There are a number of different configurations you can use for your needs. For
|
There are a number of different configurations you can use for your needs. For
|
||||||
more information on the numa=fake command line option and its various ways of
|
more information on the numa=fake command line option and its various ways of
|
||||||
configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
|
configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
|
||||||
|
@ -40,7 +40,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
|
||||||
On node 3 totalpages: 131072
|
On node 3 totalpages: 131072
|
||||||
|
|
||||||
Now following the instructions for mounting the cpusets filesystem from
|
Now following the instructions for mounting the cpusets filesystem from
|
||||||
Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
|
Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
|
||||||
address spaces) to individual cpusets::
|
address spaces) to individual cpusets::
|
||||||
|
|
||||||
[root@xroads /]# mkdir exampleset
|
[root@xroads /]# mkdir exampleset
|
||||||
|
|
|
@ -4122,7 +4122,7 @@ W: http://www.bullopensource.org/cpuset/
|
||||||
W: http://oss.sgi.com/projects/cpusets/
|
W: http://oss.sgi.com/projects/cpusets/
|
||||||
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
|
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
|
||||||
S: Maintained
|
S: Maintained
|
||||||
F: Documentation/cgroup-v1/cpusets.txt
|
F: Documentation/cgroup-v1/cpusets.rst
|
||||||
F: include/linux/cpuset.h
|
F: include/linux/cpuset.h
|
||||||
F: kernel/cgroup/cpuset.c
|
F: kernel/cgroup/cpuset.c
|
||||||
|
|
||||||
|
|
|
@ -89,7 +89,7 @@ config BLK_DEV_THROTTLING
|
||||||
one needs to mount and use blkio cgroup controller for creating
|
one needs to mount and use blkio cgroup controller for creating
|
||||||
cgroups and specifying per device IO rate policies.
|
cgroups and specifying per device IO rate policies.
|
||||||
|
|
||||||
See Documentation/cgroup-v1/blkio-controller.txt for more information.
|
See Documentation/cgroup-v1/blkio-controller.rst for more information.
|
||||||
|
|
||||||
config BLK_DEV_THROTTLING_LOW
|
config BLK_DEV_THROTTLING_LOW
|
||||||
bool "Block throttling .low limit interface support (EXPERIMENTAL)"
|
bool "Block throttling .low limit interface support (EXPERIMENTAL)"
|
||||||
|
|
|
@ -624,7 +624,7 @@ struct cftype {
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Control Group subsystem type.
|
* Control Group subsystem type.
|
||||||
* See Documentation/cgroup-v1/cgroups.txt for details
|
* See Documentation/cgroup-v1/cgroups.rst for details
|
||||||
*/
|
*/
|
||||||
struct cgroup_subsys {
|
struct cgroup_subsys {
|
||||||
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
|
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);
|
||||||
|
|
|
@ -131,6 +131,8 @@ void cgroup_free(struct task_struct *p);
|
||||||
int cgroup_init_early(void);
|
int cgroup_init_early(void);
|
||||||
int cgroup_init(void);
|
int cgroup_init(void);
|
||||||
|
|
||||||
|
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Iteration helpers and macros.
|
* Iteration helpers and macros.
|
||||||
*/
|
*/
|
||||||
|
|
|
@ -785,7 +785,7 @@ union bpf_attr {
|
||||||
* based on a user-provided identifier for all traffic coming from
|
* based on a user-provided identifier for all traffic coming from
|
||||||
* the tasks belonging to the related cgroup. See also the related
|
* the tasks belonging to the related cgroup. See also the related
|
||||||
* kernel documentation, available from the Linux sources in file
|
* kernel documentation, available from the Linux sources in file
|
||||||
* *Documentation/cgroup-v1/net_cls.txt*.
|
* *Documentation/cgroup-v1/net_cls.rst*.
|
||||||
*
|
*
|
||||||
* The Linux kernel has two versions for cgroups: there are
|
* The Linux kernel has two versions for cgroups: there are
|
||||||
* cgroups v1 and cgroups v2. Both are available to users, who can
|
* cgroups v1 and cgroups v2. Both are available to users, who can
|
||||||
|
|
|
@ -850,7 +850,7 @@ config BLK_CGROUP
|
||||||
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
|
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
|
||||||
CONFIG_BLK_DEV_THROTTLING=y.
|
CONFIG_BLK_DEV_THROTTLING=y.
|
||||||
|
|
||||||
See Documentation/cgroup-v1/blkio-controller.txt for more information.
|
See Documentation/cgroup-v1/blkio-controller.rst for more information.
|
||||||
|
|
||||||
config DEBUG_BLK_CGROUP
|
config DEBUG_BLK_CGROUP
|
||||||
bool "IO controller debugging"
|
bool "IO controller debugging"
|
||||||
|
|
|
@ -6240,6 +6240,48 @@ struct cgroup *cgroup_get_from_fd(int fd)
|
||||||
}
|
}
|
||||||
EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
|
EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
|
||||||
|
|
||||||
|
static u64 power_of_ten(int power)
|
||||||
|
{
|
||||||
|
u64 v = 1;
|
||||||
|
while (power--)
|
||||||
|
v *= 10;
|
||||||
|
return v;
|
||||||
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* cgroup_parse_float - parse a floating number
|
||||||
|
* @input: input string
|
||||||
|
* @dec_shift: number of decimal digits to shift
|
||||||
|
* @v: output
|
||||||
|
*
|
||||||
|
* Parse a decimal floating point number in @input and store the result in
|
||||||
|
* @v with decimal point right shifted @dec_shift times. For example, if
|
||||||
|
* @input is "12.3456" and @dec_shift is 3, *@v will be set to 12345.
|
||||||
|
* Returns 0 on success, -errno otherwise.
|
||||||
|
*
|
||||||
|
* There's nothing cgroup specific about this function except that it's
|
||||||
|
* currently the only user.
|
||||||
|
*/
|
||||||
|
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
|
||||||
|
{
|
||||||
|
s64 whole, frac = 0;
|
||||||
|
int fstart = 0, fend = 0, flen;
|
||||||
|
|
||||||
|
if (!sscanf(input, "%lld.%n%lld%n", &whole, &fstart, &frac, &fend))
|
||||||
|
return -EINVAL;
|
||||||
|
if (frac < 0)
|
||||||
|
return -EINVAL;
|
||||||
|
|
||||||
|
flen = fend > fstart ? fend - fstart : 0;
|
||||||
|
if (flen < dec_shift)
|
||||||
|
frac *= power_of_ten(dec_shift - flen);
|
||||||
|
else
|
||||||
|
frac = DIV_ROUND_CLOSEST_ULL(frac, power_of_ten(flen - dec_shift));
|
||||||
|
|
||||||
|
*v = whole * power_of_ten(dec_shift) + frac;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
|
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
|
||||||
* definition in cgroup-defs.h.
|
* definition in cgroup-defs.h.
|
||||||
|
@ -6402,4 +6444,5 @@ static int __init cgroup_sysfs_init(void)
|
||||||
return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group);
|
return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group);
|
||||||
}
|
}
|
||||||
subsys_initcall(cgroup_sysfs_init);
|
subsys_initcall(cgroup_sysfs_init);
|
||||||
|
|
||||||
#endif /* CONFIG_SYSFS */
|
#endif /* CONFIG_SYSFS */
|
||||||
|
|
|
@ -729,7 +729,7 @@ static inline int nr_cpusets(void)
|
||||||
* load balancing domains (sched domains) as specified by that partial
|
* load balancing domains (sched domains) as specified by that partial
|
||||||
* partition.
|
* partition.
|
||||||
*
|
*
|
||||||
* See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.txt
|
* See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst
|
||||||
* for a background explanation of this.
|
* for a background explanation of this.
|
||||||
*
|
*
|
||||||
* Does not return errors, on the theory that the callers of this
|
* Does not return errors, on the theory that the callers of this
|
||||||
|
|
|
@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent)
|
||||||
* This is one of the three key functions for hierarchy implementation.
|
* This is one of the three key functions for hierarchy implementation.
|
||||||
* This function is responsible for re-evaluating all the cgroup's active
|
* This function is responsible for re-evaluating all the cgroup's active
|
||||||
* exceptions due to a parent's exception change.
|
* exceptions due to a parent's exception change.
|
||||||
* Refer to Documentation/cgroup-v1/devices.txt for more details.
|
* Refer to Documentation/cgroup-v1/devices.rst for more details.
|
||||||
*/
|
*/
|
||||||
static void revalidate_active_exceptions(struct dev_cgroup *devcg)
|
static void revalidate_active_exceptions(struct dev_cgroup *devcg)
|
||||||
{
|
{
|
||||||
|
|
|
@ -785,7 +785,7 @@ union bpf_attr {
|
||||||
* based on a user-provided identifier for all traffic coming from
|
* based on a user-provided identifier for all traffic coming from
|
||||||
* the tasks belonging to the related cgroup. See also the related
|
* the tasks belonging to the related cgroup. See also the related
|
||||||
* kernel documentation, available from the Linux sources in file
|
* kernel documentation, available from the Linux sources in file
|
||||||
* *Documentation/cgroup-v1/net_cls.txt*.
|
* *Documentation/cgroup-v1/net_cls.rst*.
|
||||||
*
|
*
|
||||||
* The Linux kernel has two versions for cgroups: there are
|
* The Linux kernel has two versions for cgroups: there are
|
||||||
* cgroups v1 and cgroups v2. Both are available to users, who can
|
* cgroups v1 and cgroups v2. Both are available to users, who can
|
||||||
|
|
Loading…
Reference in New Issue