linux/block/blk-cgroup.c

1563 lines
40 KiB
C
Raw Normal View History

/*
* Common Block IO controller cgroup interface
*
* Based on ideas and code from CFQ, CFS and BFQ:
* Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
*
* Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
* Paolo Valente <paolo.valente@unimore.it>
*
* Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
* Nauman Rafique <nauman@google.com>
*/
#include <linux/ioprio.h>
#include <linux/seq_file.h>
#include <linux/kdev_t.h>
#include <linux/module.h>
#include <linux/err.h>
#include <linux/blkdev.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
#include <linux/slab.h>
#include <linux/genhd.h>
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
#include <linux/delay.h>
#include <linux/atomic.h>
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
#include "blk-cgroup.h"
#include "blk.h"
2010-04-09 14:31:19 +08:00
#define MAX_KEY_LEN 100
static DEFINE_SPINLOCK(blkio_list_lock);
static LIST_HEAD(blkio_list);
static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
/* List of groups pending per cpu stats allocation */
static DEFINE_SPINLOCK(alloc_list_lock);
static LIST_HEAD(alloc_list);
static void blkio_stat_alloc_fn(struct work_struct *);
static DECLARE_DELAYED_WORK(blkio_stat_alloc_work, blkio_stat_alloc_fn);
struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
EXPORT_SYMBOL_GPL(blkio_root_cgroup);
static struct blkio_policy_type *blkio_policy[BLKIO_NR_POLICIES];
struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
{
return container_of(cgroup_subsys_state(cgroup, blkio_subsys_id),
struct blkio_cgroup, css);
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
static struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk)
{
return container_of(task_subsys_state(tsk, blkio_subsys_id),
struct blkio_cgroup, css);
}
struct blkio_cgroup *bio_blkio_cgroup(struct bio *bio)
{
if (bio && bio->bi_css)
return container_of(bio->bi_css, struct blkio_cgroup, css);
return task_blkio_cgroup(current);
}
EXPORT_SYMBOL_GPL(bio_blkio_cgroup);
static inline void blkio_update_group_weight(struct blkio_group *blkg,
int plid, unsigned int weight)
{
struct blkio_policy_type *blkiop;
list_for_each_entry(blkiop, &blkio_list, list) {
/* If this policy does not own the blkg, do not send updates */
if (blkiop->plid != plid)
continue;
if (blkiop->ops.blkio_update_group_weight_fn)
blkiop->ops.blkio_update_group_weight_fn(blkg->q,
blkg, weight);
}
}
static inline void blkio_update_group_bps(struct blkio_group *blkg, int plid,
u64 bps, int rw)
{
struct blkio_policy_type *blkiop;
list_for_each_entry(blkiop, &blkio_list, list) {
/* If this policy does not own the blkg, do not send updates */
if (blkiop->plid != plid)
continue;
if (rw == READ && blkiop->ops.blkio_update_group_read_bps_fn)
blkiop->ops.blkio_update_group_read_bps_fn(blkg->q,
blkg, bps);
if (rw == WRITE && blkiop->ops.blkio_update_group_write_bps_fn)
blkiop->ops.blkio_update_group_write_bps_fn(blkg->q,
blkg, bps);
}
}
static inline void blkio_update_group_iops(struct blkio_group *blkg, int plid,
u64 iops, int rw)
{
struct blkio_policy_type *blkiop;
list_for_each_entry(blkiop, &blkio_list, list) {
/* If this policy does not own the blkg, do not send updates */
if (blkiop->plid != plid)
continue;
if (rw == READ && blkiop->ops.blkio_update_group_read_iops_fn)
blkiop->ops.blkio_update_group_read_iops_fn(blkg->q,
blkg, iops);
if (rw == WRITE && blkiop->ops.blkio_update_group_write_iops_fn)
blkiop->ops.blkio_update_group_write_iops_fn(blkg->q,
blkg,iops);
}
}
#ifdef CONFIG_DEBUG_BLK_CGROUP
/* This should be called with the queue_lock held. */
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
struct blkio_policy_type *pol,
struct blkio_group *curr_blkg)
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
{
struct blkg_policy_data *pd = blkg->pd[pol->plid];
if (blkio_blkg_waiting(&pd->stats))
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
return;
if (blkg == curr_blkg)
return;
pd->stats.start_group_wait_time = sched_clock();
blkio_mark_blkg_waiting(&pd->stats);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
}
/* This should be called with the queue_lock held. */
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
{
unsigned long long now;
if (!blkio_blkg_waiting(stats))
return;
now = sched_clock();
if (time_after64(now, stats->start_group_wait_time))
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats->group_wait_time,
now - stats->start_group_wait_time);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
blkio_clear_blkg_waiting(stats);
}
/* This should be called with the queue_lock held. */
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
static void blkio_end_empty_time(struct blkio_group_stats *stats)
{
unsigned long long now;
if (!blkio_blkg_empty(stats))
return;
now = sched_clock();
if (time_after64(now, stats->start_empty_time))
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats->empty_time,
now - stats->start_empty_time);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
blkio_clear_blkg_empty(stats);
}
void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol)
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
lockdep_assert_held(blkg->q->queue_lock);
BUG_ON(blkio_blkg_idling(stats));
stats->start_idle_time = sched_clock();
blkio_mark_blkg_idling(stats);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
}
EXPORT_SYMBOL_GPL(blkiocg_update_set_idle_time_stats);
void blkiocg_update_idle_time_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol)
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
lockdep_assert_held(blkg->q->queue_lock);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
if (blkio_blkg_idling(stats)) {
unsigned long long now = sched_clock();
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
if (time_after64(now, stats->start_idle_time))
blkg_stat_add(&stats->idle_time,
now - stats->start_idle_time);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
blkio_clear_blkg_idling(stats);
}
}
EXPORT_SYMBOL_GPL(blkiocg_update_idle_time_stats);
void blkiocg_update_avg_queue_size_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats->avg_queue_size_sum,
blkg_rwstat_sum(&stats->queued));
blkg_stat_add(&stats->avg_queue_size_samples, 1);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
blkio_update_group_wait_time(stats);
}
EXPORT_SYMBOL_GPL(blkiocg_update_avg_queue_size_stats);
void blkiocg_set_start_empty_time(struct blkio_group *blkg,
struct blkio_policy_type *pol)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
if (blkg_rwstat_sum(&stats->queued))
return;
/*
blkio: Fix another BUG_ON() crash due to cfqq movement across groups o Once in a while, I was hitting a BUG_ON() in blkio code. empty_time was assuming that upon slice expiry, group can't be marked empty already (except forced dispatch). But this assumption is broken if cfqq can move (group_isolation=0) across groups after receiving a request. I think most likely in this case we got a request in a cfqq and accounted the rq in one group, later while adding the cfqq to tree, we moved the queue to a different group which was already marked empty and after dispatch from slice we found group already marked empty and raised alarm. This patch does not error out if group is already marked empty. This can introduce some empty_time stat error only in case of group_isolation=0. This is better than crashing. In case of group_isolation=1 we should still get same stats as before this patch. [ 222.308546] ------------[ cut here ]------------ [ 222.309311] kernel BUG at block/blk-cgroup.c:236! [ 222.309311] invalid opcode: 0000 [#1] SMP [ 222.309311] last sysfs file: /sys/devices/virtual/block/dm-3/queue/scheduler [ 222.309311] CPU 1 [ 222.309311] Modules linked in: dm_round_robin dm_multipath qla2xxx scsi_transport_fc dm_zero dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [ 222.309311] [ 222.309311] Pid: 4780, comm: fio Not tainted 2.6.34-rc4-blkio-config #68 0A98h/HP xw8600 Workstation [ 222.309311] RIP: 0010:[<ffffffff8121ad88>] [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83 [ 222.309311] RSP: 0018:ffff8800ba6e79f8 EFLAGS: 00010002 [ 222.309311] RAX: 0000000000000082 RBX: ffff8800a13b7990 RCX: ffff8800a13b7808 [ 222.309311] RDX: 0000000000002121 RSI: 0000000000000082 RDI: ffff8800a13b7a30 [ 222.309311] RBP: ffff8800ba6e7a18 R08: 0000000000000000 R09: 0000000000000001 [ 222.309311] R10: 000000000002f8c8 R11: ffff8800ba6e7ad8 R12: ffff8800a13b78ff [ 222.309311] R13: ffff8800a13b7990 R14: 0000000000000001 R15: ffff8800a13b7808 [ 222.309311] FS: 00007f3beec476f0(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000 [ 222.309311] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 222.309311] CR2: 000000000040e7f0 CR3: 00000000a12d5000 CR4: 00000000000006e0 [ 222.309311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 222.309311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 222.309311] Process fio (pid: 4780, threadinfo ffff8800ba6e6000, task ffff8800b3d6bf00) [ 222.309311] Stack: [ 222.309311] 0000000000000001 ffff8800bab17a48 ffff8800bab17a48 ffff8800a13b7800 [ 222.309311] <0> ffff8800ba6e7a68 ffffffff8121da35 ffff880000000001 00ff8800ba5c5698 [ 222.309311] <0> ffff8800ba6e7a68 ffff8800a13b7800 0000000000000000 ffff8800bab17a48 [ 222.309311] Call Trace: [ 222.309311] [<ffffffff8121da35>] __cfq_slice_expired+0x2af/0x3ec [ 222.309311] [<ffffffff8121fd7b>] cfq_dispatch_requests+0x2c8/0x8e8 [ 222.309311] [<ffffffff8120f1cd>] ? spin_unlock_irqrestore+0xe/0x10 [ 222.309311] [<ffffffff8120fb1a>] ? blk_insert_cloned_request+0x70/0x7b [ 222.309311] [<ffffffff81210461>] blk_peek_request+0x191/0x1a7 [ 222.309311] [<ffffffffa0002799>] dm_request_fn+0x38/0x14c [dm_mod] [ 222.309311] [<ffffffff810ae61f>] ? sync_page_killable+0x0/0x35 [ 222.309311] [<ffffffff81210fd4>] __generic_unplug_device+0x32/0x37 [ 222.309311] [<ffffffff81211274>] generic_unplug_device+0x2e/0x3c [ 222.309311] [<ffffffffa00011a6>] dm_unplug_all+0x42/0x5b [dm_mod] [ 222.309311] [<ffffffff8120ca37>] blk_unplug+0x29/0x2d [ 222.309311] [<ffffffff8120ca4d>] blk_backing_dev_unplug+0x12/0x14 [ 222.309311] [<ffffffff81109a7a>] block_sync_page+0x35/0x39 [ 222.309311] [<ffffffff810ae616>] sync_page+0x41/0x4a [ 222.309311] [<ffffffff810ae62d>] sync_page_killable+0xe/0x35 [ 222.309311] [<ffffffff8158aa59>] __wait_on_bit_lock+0x46/0x8f [ 222.309311] [<ffffffff810ae4f5>] __lock_page_killable+0x66/0x6d [ 222.309311] [<ffffffff81056f9c>] ? wake_bit_function+0x0/0x33 [ 222.309311] [<ffffffff810ae528>] lock_page_killable+0x2c/0x2e [ 222.309311] [<ffffffff810afbc5>] generic_file_aio_read+0x361/0x4f0 [ 222.309311] [<ffffffff810ea044>] do_sync_read+0xcb/0x108 [ 222.309311] [<ffffffff811e42f7>] ? security_file_permission+0x16/0x18 [ 222.309311] [<ffffffff810ea6ab>] vfs_read+0xab/0x108 [ 222.309311] [<ffffffff810ea7c8>] sys_read+0x4a/0x6e [ 222.309311] [<ffffffff81002b5b>] system_call_fastpath+0x16/0x1b [ 222.309311] Code: 58 01 00 00 00 48 89 c6 75 0a 48 83 bb 60 01 00 00 00 74 09 48 8d bb a0 00 00 00 eb 35 41 fe cc 74 0d f6 83 c0 01 00 00 04 74 04 <0f> 0b eb fe 48 89 75 e8 e8 be e0 de ff 66 83 8b c0 01 00 00 04 [ 222.309311] RIP [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83 [ 222.309311] RSP <ffff8800ba6e79f8> [ 222.309311] ---[ end trace 32b4f71dffc15712 ]--- Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Divyesh Shah <dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-27 01:25:11 +08:00
* group is already marked empty. This can happen if cfqq got new
* request in parent group and moved to this group while being added
* to service tree. Just ignore the event and move on.
*/
if (blkio_blkg_empty(stats))
blkio: Fix another BUG_ON() crash due to cfqq movement across groups o Once in a while, I was hitting a BUG_ON() in blkio code. empty_time was assuming that upon slice expiry, group can't be marked empty already (except forced dispatch). But this assumption is broken if cfqq can move (group_isolation=0) across groups after receiving a request. I think most likely in this case we got a request in a cfqq and accounted the rq in one group, later while adding the cfqq to tree, we moved the queue to a different group which was already marked empty and after dispatch from slice we found group already marked empty and raised alarm. This patch does not error out if group is already marked empty. This can introduce some empty_time stat error only in case of group_isolation=0. This is better than crashing. In case of group_isolation=1 we should still get same stats as before this patch. [ 222.308546] ------------[ cut here ]------------ [ 222.309311] kernel BUG at block/blk-cgroup.c:236! [ 222.309311] invalid opcode: 0000 [#1] SMP [ 222.309311] last sysfs file: /sys/devices/virtual/block/dm-3/queue/scheduler [ 222.309311] CPU 1 [ 222.309311] Modules linked in: dm_round_robin dm_multipath qla2xxx scsi_transport_fc dm_zero dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [ 222.309311] [ 222.309311] Pid: 4780, comm: fio Not tainted 2.6.34-rc4-blkio-config #68 0A98h/HP xw8600 Workstation [ 222.309311] RIP: 0010:[<ffffffff8121ad88>] [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83 [ 222.309311] RSP: 0018:ffff8800ba6e79f8 EFLAGS: 00010002 [ 222.309311] RAX: 0000000000000082 RBX: ffff8800a13b7990 RCX: ffff8800a13b7808 [ 222.309311] RDX: 0000000000002121 RSI: 0000000000000082 RDI: ffff8800a13b7a30 [ 222.309311] RBP: ffff8800ba6e7a18 R08: 0000000000000000 R09: 0000000000000001 [ 222.309311] R10: 000000000002f8c8 R11: ffff8800ba6e7ad8 R12: ffff8800a13b78ff [ 222.309311] R13: ffff8800a13b7990 R14: 0000000000000001 R15: ffff8800a13b7808 [ 222.309311] FS: 00007f3beec476f0(0000) GS:ffff880001e40000(0000) knlGS:0000000000000000 [ 222.309311] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 222.309311] CR2: 000000000040e7f0 CR3: 00000000a12d5000 CR4: 00000000000006e0 [ 222.309311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 222.309311] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 222.309311] Process fio (pid: 4780, threadinfo ffff8800ba6e6000, task ffff8800b3d6bf00) [ 222.309311] Stack: [ 222.309311] 0000000000000001 ffff8800bab17a48 ffff8800bab17a48 ffff8800a13b7800 [ 222.309311] <0> ffff8800ba6e7a68 ffffffff8121da35 ffff880000000001 00ff8800ba5c5698 [ 222.309311] <0> ffff8800ba6e7a68 ffff8800a13b7800 0000000000000000 ffff8800bab17a48 [ 222.309311] Call Trace: [ 222.309311] [<ffffffff8121da35>] __cfq_slice_expired+0x2af/0x3ec [ 222.309311] [<ffffffff8121fd7b>] cfq_dispatch_requests+0x2c8/0x8e8 [ 222.309311] [<ffffffff8120f1cd>] ? spin_unlock_irqrestore+0xe/0x10 [ 222.309311] [<ffffffff8120fb1a>] ? blk_insert_cloned_request+0x70/0x7b [ 222.309311] [<ffffffff81210461>] blk_peek_request+0x191/0x1a7 [ 222.309311] [<ffffffffa0002799>] dm_request_fn+0x38/0x14c [dm_mod] [ 222.309311] [<ffffffff810ae61f>] ? sync_page_killable+0x0/0x35 [ 222.309311] [<ffffffff81210fd4>] __generic_unplug_device+0x32/0x37 [ 222.309311] [<ffffffff81211274>] generic_unplug_device+0x2e/0x3c [ 222.309311] [<ffffffffa00011a6>] dm_unplug_all+0x42/0x5b [dm_mod] [ 222.309311] [<ffffffff8120ca37>] blk_unplug+0x29/0x2d [ 222.309311] [<ffffffff8120ca4d>] blk_backing_dev_unplug+0x12/0x14 [ 222.309311] [<ffffffff81109a7a>] block_sync_page+0x35/0x39 [ 222.309311] [<ffffffff810ae616>] sync_page+0x41/0x4a [ 222.309311] [<ffffffff810ae62d>] sync_page_killable+0xe/0x35 [ 222.309311] [<ffffffff8158aa59>] __wait_on_bit_lock+0x46/0x8f [ 222.309311] [<ffffffff810ae4f5>] __lock_page_killable+0x66/0x6d [ 222.309311] [<ffffffff81056f9c>] ? wake_bit_function+0x0/0x33 [ 222.309311] [<ffffffff810ae528>] lock_page_killable+0x2c/0x2e [ 222.309311] [<ffffffff810afbc5>] generic_file_aio_read+0x361/0x4f0 [ 222.309311] [<ffffffff810ea044>] do_sync_read+0xcb/0x108 [ 222.309311] [<ffffffff811e42f7>] ? security_file_permission+0x16/0x18 [ 222.309311] [<ffffffff810ea6ab>] vfs_read+0xab/0x108 [ 222.309311] [<ffffffff810ea7c8>] sys_read+0x4a/0x6e [ 222.309311] [<ffffffff81002b5b>] system_call_fastpath+0x16/0x1b [ 222.309311] Code: 58 01 00 00 00 48 89 c6 75 0a 48 83 bb 60 01 00 00 00 74 09 48 8d bb a0 00 00 00 eb 35 41 fe cc 74 0d f6 83 c0 01 00 00 04 74 04 <0f> 0b eb fe 48 89 75 e8 e8 be e0 de ff 66 83 8b c0 01 00 00 04 [ 222.309311] RIP [<ffffffff8121ad88>] blkiocg_set_start_empty_time+0x50/0x83 [ 222.309311] RSP <ffff8800ba6e79f8> [ 222.309311] ---[ end trace 32b4f71dffc15712 ]--- Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Divyesh Shah <dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-27 01:25:11 +08:00
return;
stats->start_empty_time = sched_clock();
blkio_mark_blkg_empty(stats);
}
EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
unsigned long dequeue)
{
struct blkg_policy_data *pd = blkg->pd[pol->plid];
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&pd->stats.dequeue, dequeue);
}
EXPORT_SYMBOL_GPL(blkiocg_update_dequeue_stats);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
#else
static inline void blkio_set_start_group_wait_time(struct blkio_group *blkg,
struct blkio_policy_type *pol,
struct blkio_group *curr_blkg) { }
static inline void blkio_end_empty_time(struct blkio_group_stats *stats) { }
#endif
void blkiocg_update_io_add_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
struct blkio_group *curr_blkg, bool direction,
bool sync)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
int rw = (direction ? REQ_WRITE : 0) | (sync ? REQ_SYNC : 0);
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_add(&stats->queued, rw, 1);
blkio_end_empty_time(stats);
blkio_set_start_group_wait_time(blkg, pol, curr_blkg);
}
EXPORT_SYMBOL_GPL(blkiocg_update_io_add_stats);
void blkiocg_update_io_remove_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
bool direction, bool sync)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
int rw = (direction ? REQ_WRITE : 0) | (sync ? REQ_SYNC : 0);
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_add(&stats->queued, rw, -1);
}
EXPORT_SYMBOL_GPL(blkiocg_update_io_remove_stats);
void blkiocg_update_timeslice_used(struct blkio_group *blkg,
struct blkio_policy_type *pol,
unsigned long time,
unsigned long unaccounted_time)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats->time, time);
#ifdef CONFIG_DEBUG_BLK_CGROUP
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats->unaccounted_time, unaccounted_time);
#endif
}
EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
/*
* should be called under rcu read lock or queue lock to make sure blkg pointer
* is valid.
*/
2010-04-09 14:31:19 +08:00
void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
uint64_t bytes, bool direction, bool sync)
{
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
int rw = (direction ? REQ_WRITE : 0) | (sync ? REQ_SYNC : 0);
struct blkg_policy_data *pd = blkg->pd[pol->plid];
struct blkio_group_stats_cpu *stats_cpu;
unsigned long flags;
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
/* If per cpu stats are not allocated yet, don't do any accounting. */
if (pd->stats_cpu == NULL)
return;
/*
* Disabling interrupts to provide mutual exclusion between two
* writes on same cpu. It probably is not needed for 64bit. Not
* optimizing that case yet.
*/
local_irq_save(flags);
stats_cpu = this_cpu_ptr(pd->stats_cpu);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_add(&stats_cpu->sectors, bytes >> 9);
blkg_rwstat_add(&stats_cpu->serviced, rw, 1);
blkg_rwstat_add(&stats_cpu->service_bytes, rw, bytes);
local_irq_restore(flags);
}
2010-04-09 14:31:19 +08:00
EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);
2010-04-09 14:31:19 +08:00
void blkiocg_update_completion_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
uint64_t start_time,
uint64_t io_start_time, bool direction,
bool sync)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
unsigned long long now = sched_clock();
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
int rw = (direction ? REQ_WRITE : 0) | (sync ? REQ_SYNC : 0);
lockdep_assert_held(blkg->q->queue_lock);
2010-04-09 14:31:19 +08:00
if (time_after64(now, io_start_time))
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_add(&stats->service_time, rw, now - io_start_time);
2010-04-09 14:31:19 +08:00
if (time_after64(io_start_time, start_time))
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_add(&stats->wait_time, rw,
io_start_time - start_time);
}
2010-04-09 14:31:19 +08:00
EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
/* Merged stats are per cpu. */
void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
struct blkio_policy_type *pol,
bool direction, bool sync)
{
struct blkio_group_stats *stats = &blkg->pd[pol->plid]->stats;
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
int rw = (direction ? REQ_WRITE : 0) | (sync ? REQ_SYNC : 0);
lockdep_assert_held(blkg->q->queue_lock);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_add(&stats->merged, rw, 1);
}
EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
/*
* Worker for allocating per cpu stat for blk groups. This is scheduled on
* the system_nrt_wq once there are some groups on the alloc_list waiting
* for allocation.
*/
static void blkio_stat_alloc_fn(struct work_struct *work)
{
static void *pcpu_stats[BLKIO_NR_POLICIES];
struct delayed_work *dwork = to_delayed_work(work);
struct blkio_group *blkg;
int i;
bool empty = false;
alloc_stats:
for (i = 0; i < BLKIO_NR_POLICIES; i++) {
if (pcpu_stats[i] != NULL)
continue;
pcpu_stats[i] = alloc_percpu(struct blkio_group_stats_cpu);
/* Allocation failed. Try again after some time. */
if (pcpu_stats[i] == NULL) {
queue_delayed_work(system_nrt_wq, dwork,
msecs_to_jiffies(10));
return;
}
}
spin_lock_irq(&blkio_list_lock);
spin_lock(&alloc_list_lock);
/* cgroup got deleted or queue exited. */
if (!list_empty(&alloc_list)) {
blkg = list_first_entry(&alloc_list, struct blkio_group,
alloc_node);
for (i = 0; i < BLKIO_NR_POLICIES; i++) {
struct blkg_policy_data *pd = blkg->pd[i];
if (blkio_policy[i] && pd && !pd->stats_cpu)
swap(pd->stats_cpu, pcpu_stats[i]);
}
list_del_init(&blkg->alloc_node);
}
empty = list_empty(&alloc_list);
spin_unlock(&alloc_list_lock);
spin_unlock_irq(&blkio_list_lock);
if (!empty)
goto alloc_stats;
}
/**
* blkg_free - free a blkg
* @blkg: blkg to free
*
* Free @blkg which may be partially allocated.
*/
static void blkg_free(struct blkio_group *blkg)
{
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
int i;
if (!blkg)
return;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
for (i = 0; i < BLKIO_NR_POLICIES; i++) {
struct blkg_policy_data *pd = blkg->pd[i];
if (pd) {
free_percpu(pd->stats_cpu);
kfree(pd);
}
}
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
kfree(blkg);
}
/**
* blkg_alloc - allocate a blkg
* @blkcg: block cgroup the new blkg is associated with
* @q: request_queue the new blkg is associated with
*
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
* Allocate a new blkg assocating @blkcg and @q.
*/
static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
struct request_queue *q)
{
struct blkio_group *blkg;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
int i;
/* alloc and init base part */
blkg = kzalloc_node(sizeof(*blkg), GFP_ATOMIC, q->node);
if (!blkg)
return NULL;
blkg->q = q;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
INIT_LIST_HEAD(&blkg->q_node);
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
INIT_LIST_HEAD(&blkg->alloc_node);
blkg->blkcg = blkcg;
blkg->refcnt = 1;
cgroup_path(blkcg->css.cgroup, blkg->path, sizeof(blkg->path));
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
for (i = 0; i < BLKIO_NR_POLICIES; i++) {
struct blkio_policy_type *pol = blkio_policy[i];
struct blkg_policy_data *pd;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
if (!pol)
continue;
/* alloc per-policy data and attach it to blkg */
pd = kzalloc_node(sizeof(*pd) + pol->pdata_size, GFP_ATOMIC,
q->node);
if (!pd) {
blkg_free(blkg);
return NULL;
}
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
blkg->pd[i] = pd;
pd->blkg = blkg;
}
/* invoke per-policy init */
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
for (i = 0; i < BLKIO_NR_POLICIES; i++) {
struct blkio_policy_type *pol = blkio_policy[i];
if (pol)
pol->ops.blkio_init_group_fn(blkg);
}
return blkg;
}
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
struct request_queue *q,
bool for_root)
__releases(q->queue_lock) __acquires(q->queue_lock)
{
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
struct blkio_group *blkg;
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
WARN_ON_ONCE(!rcu_read_lock_held());
lockdep_assert_held(q->queue_lock);
/*
* This could be the first entry point of blkcg implementation and
* we shouldn't allow anything to go through for a bypassing queue.
* The following can be removed if blkg lookup is guaranteed to
* fail on a bypassing queue.
*/
if (unlikely(blk_queue_bypass(q)) && !for_root)
return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
blkg = blkg_lookup(blkcg, q);
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
if (blkg)
return blkg;
/* blkg holds a reference to blkcg */
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
if (!css_tryget(&blkcg->css))
return ERR_PTR(-EINVAL);
/*
* Allocate and initialize.
*/
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
blkg = blkg_alloc(blkcg, q);
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
/* did alloc fail? */
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
if (unlikely(!blkg)) {
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
blkg = ERR_PTR(-ENOMEM);
goto out;
}
/* insert */
spin_lock(&blkcg->lock);
hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
list_add(&blkg->q_node, &q->blkg_list);
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
spin_unlock(&blkcg->lock);
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
spin_lock(&alloc_list_lock);
list_add(&blkg->alloc_node, &alloc_list);
/* Queue per cpu stat allocation from worker thread. */
queue_delayed_work(system_nrt_wq, &blkio_stat_alloc_work, 0);
spin_unlock(&alloc_list_lock);
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
out:
return blkg;
}
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
EXPORT_SYMBOL_GPL(blkg_lookup_create);
/* called under rcu_read_lock(). */
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
struct request_queue *q)
{
struct blkio_group *blkg;
struct hlist_node *n;
hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node)
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
if (blkg->q == q)
return blkg;
return NULL;
}
blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:06 +08:00
EXPORT_SYMBOL_GPL(blkg_lookup);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
static void blkg_destroy(struct blkio_group *blkg)
{
struct request_queue *q = blkg->q;
struct blkio_cgroup *blkcg = blkg->blkcg;
lockdep_assert_held(q->queue_lock);
lockdep_assert_held(&blkcg->lock);
/* Something wrong if we are trying to remove same group twice */
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
WARN_ON_ONCE(list_empty(&blkg->q_node));
WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
list_del_init(&blkg->q_node);
hlist_del_init_rcu(&blkg->blkcg_node);
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
spin_lock(&alloc_list_lock);
list_del_init(&blkg->alloc_node);
spin_unlock(&alloc_list_lock);
/*
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
blkg_put(blkg);
}
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
/*
* XXX: This updates blkg policy data in-place for root blkg, which is
* necessary across elevator switch and policy registration as root blkgs
* aren't shot down. This broken and racy implementation is temporary.
* Eventually, blkg shoot down will be replaced by proper in-place update.
*/
void update_root_blkg_pd(struct request_queue *q, enum blkio_policy_id plid)
{
struct blkio_policy_type *pol = blkio_policy[plid];
struct blkio_group *blkg = blkg_lookup(&blkio_root_cgroup, q);
struct blkg_policy_data *pd;
if (!blkg)
return;
kfree(blkg->pd[plid]);
blkg->pd[plid] = NULL;
if (!pol)
return;
pd = kzalloc(sizeof(*pd) + pol->pdata_size, GFP_KERNEL);
WARN_ON_ONCE(!pd);
pd->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
WARN_ON_ONCE(!pd->stats_cpu);
blkg->pd[plid] = pd;
pd->blkg = blkg;
pol->ops.blkio_init_group_fn(blkg);
}
EXPORT_SYMBOL_GPL(update_root_blkg_pd);
/**
* blkg_destroy_all - destroy all blkgs associated with a request_queue
* @q: request_queue of interest
* @destroy_root: whether to destroy root blkg or not
*
* Destroy blkgs associated with @q. If @destroy_root is %true, all are
* destroyed; otherwise, root blkg is left alone.
*/
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
void blkg_destroy_all(struct request_queue *q, bool destroy_root)
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
{
struct blkio_group *blkg, *n;
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
spin_lock_irq(q->queue_lock);
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
struct blkio_cgroup *blkcg = blkg->blkcg;
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
/* skip root? */
if (!destroy_root && blkg->blkcg == &blkio_root_cgroup)
continue;
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
spin_lock(&blkcg->lock);
blkg_destroy(blkg);
spin_unlock(&blkcg->lock);
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
}
spin_unlock_irq(q->queue_lock);
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
}
EXPORT_SYMBOL_GPL(blkg_destroy_all);
blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:00 +08:00
static void blkg_rcu_free(struct rcu_head *rcu_head)
{
blkg_free(container_of(rcu_head, struct blkio_group, rcu_head));
}
void __blkg_release(struct blkio_group *blkg)
{
/* release the extra blkcg reference this blkg has been holding */
css_put(&blkg->blkcg->css);
/*
* A group is freed in rcu manner. But having an rcu lock does not
* mean that one can access all the fields of blkg and assume these
* are valid. For example, don't try to follow throtl_data and
* request queue links.
*
* Having a reference to blkg under an rcu allows acess to only
* values local to groups like group stats and group rate limits
*/
call_rcu(&blkg->rcu_head, blkg_rcu_free);
}
EXPORT_SYMBOL_GPL(__blkg_release);
static void blkio_reset_stats_cpu(struct blkio_group *blkg, int plid)
{
struct blkg_policy_data *pd = blkg->pd[plid];
int cpu;
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
if (pd->stats_cpu == NULL)
return;
for_each_possible_cpu(cpu) {
struct blkio_group_stats_cpu *sc =
per_cpu_ptr(pd->stats_cpu, cpu);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_reset(&sc->service_bytes);
blkg_rwstat_reset(&sc->serviced);
blkg_stat_reset(&sc->sectors);
}
}
static int
2010-04-09 14:31:19 +08:00
blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
struct blkio_group *blkg;
struct hlist_node *n;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
spin_lock(&blkio_list_lock);
spin_lock_irq(&blkcg->lock);
/*
* Note that stat reset is racy - it doesn't synchronize against
* stat updates. This is a debug feature which shouldn't exist
* anyway. If you get hit by a race, retry.
*/
hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
struct blkio_policy_type *pol;
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
list_for_each_entry(pol, &blkio_list, list) {
struct blkg_policy_data *pd = blkg->pd[pol->plid];
struct blkio_group_stats *stats = &pd->stats;
/* queued stats shouldn't be cleared */
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_rwstat_reset(&stats->merged);
blkg_rwstat_reset(&stats->service_time);
blkg_rwstat_reset(&stats->wait_time);
blkg_stat_reset(&stats->time);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
#ifdef CONFIG_DEBUG_BLK_CGROUP
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkg_stat_reset(&stats->unaccounted_time);
blkg_stat_reset(&stats->avg_queue_size_sum);
blkg_stat_reset(&stats->avg_queue_size_samples);
blkg_stat_reset(&stats->dequeue);
blkg_stat_reset(&stats->group_wait_time);
blkg_stat_reset(&stats->idle_time);
blkg_stat_reset(&stats->empty_time);
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
#endif
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
blkio_reset_stats_cpu(blkg, pol->plid);
}
}
spin_unlock_irq(&blkcg->lock);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
spin_unlock(&blkio_list_lock);
return 0;
}
static const char *blkg_dev_name(struct blkio_group *blkg)
{
/* some drivers (floppy) instantiate a queue w/o disk registered */
if (blkg->q->backing_dev_info.dev)
return dev_name(blkg->q->backing_dev_info.dev);
return NULL;
}
/**
* blkcg_print_blkgs - helper for printing per-blkg data
* @sf: seq_file to print to
* @blkcg: blkcg of interest
* @prfill: fill function to print out a blkg
* @pol: policy in question
* @data: data to be passed to @prfill
* @show_total: to print out sum of prfill return values or not
*
* This function invokes @prfill on each blkg of @blkcg if pd for the
* policy specified by @pol exists. @prfill is invoked with @sf, the
* policy data and @data. If @show_total is %true, the sum of the return
* values from @prfill is printed with "Total" label at the end.
*
* This is to be used to construct print functions for
* cftype->read_seq_string method.
*/
static void blkcg_print_blkgs(struct seq_file *sf, struct blkio_cgroup *blkcg,
u64 (*prfill)(struct seq_file *,
struct blkg_policy_data *, int),
int pol, int data, bool show_total)
{
struct blkio_group *blkg;
struct hlist_node *n;
u64 total = 0;
spin_lock_irq(&blkcg->lock);
hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
if (blkg->pd[pol])
total += prfill(sf, blkg->pd[pol], data);
spin_unlock_irq(&blkcg->lock);
if (show_total)
seq_printf(sf, "Total %llu\n", (unsigned long long)total);
}
/**
* __blkg_prfill_u64 - prfill helper for a single u64 value
* @sf: seq_file to print to
* @pd: policy data of interest
* @v: value to print
*
* Print @v to @sf for the device assocaited with @pd.
*/
static u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd,
u64 v)
{
const char *dname = blkg_dev_name(pd->blkg);
if (!dname)
return 0;
seq_printf(sf, "%s %llu\n", dname, (unsigned long long)v);
return v;
}
/**
* __blkg_prfill_rwstat - prfill helper for a blkg_rwstat
* @sf: seq_file to print to
* @pd: policy data of interest
* @rwstat: rwstat to print
*
* Print @rwstat to @sf for the device assocaited with @pd.
*/
static u64 __blkg_prfill_rwstat(struct seq_file *sf,
struct blkg_policy_data *pd,
const struct blkg_rwstat *rwstat)
{
static const char *rwstr[] = {
[BLKG_RWSTAT_READ] = "Read",
[BLKG_RWSTAT_WRITE] = "Write",
[BLKG_RWSTAT_SYNC] = "Sync",
[BLKG_RWSTAT_ASYNC] = "Async",
};
const char *dname = blkg_dev_name(pd->blkg);
u64 v;
int i;
if (!dname)
return 0;
for (i = 0; i < BLKG_RWSTAT_NR; i++)
seq_printf(sf, "%s %s %llu\n", dname, rwstr[i],
(unsigned long long)rwstat->cnt[i]);
v = rwstat->cnt[BLKG_RWSTAT_READ] + rwstat->cnt[BLKG_RWSTAT_WRITE];
seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v);
return v;
}
static u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
return __blkg_prfill_u64(sf, pd,
blkg_stat_read((void *)&pd->stats + off));
}
static u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
int off)
{
struct blkg_rwstat rwstat = blkg_rwstat_read((void *)&pd->stats + off);
return __blkg_prfill_rwstat(sf, pd, &rwstat);
}
/* print blkg_stat specified by BLKCG_STAT_PRIV() */
static int blkcg_print_stat(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
blkcg_print_blkgs(sf, blkcg, blkg_prfill_stat,
BLKCG_STAT_POL(cft->private),
BLKCG_STAT_OFF(cft->private), false);
return 0;
}
/* print blkg_rwstat specified by BLKCG_STAT_PRIV() */
static int blkcg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
blkcg_print_blkgs(sf, blkcg, blkg_prfill_rwstat,
BLKCG_STAT_POL(cft->private),
BLKCG_STAT_OFF(cft->private), true);
return 0;
}
static u64 blkg_prfill_cpu_stat(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
u64 v = 0;
int cpu;
blkcg: alloc per cpu stats from worker thread in a delayed manner Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-09 02:53:56 +08:00
for_each_possible_cpu(cpu) {
struct blkio_group_stats_cpu *sc =
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
per_cpu_ptr(pd->stats_cpu, cpu);
v += blkg_stat_read((void *)sc + off);
}
return __blkg_prfill_u64(sf, pd, v);
}
static u64 blkg_prfill_cpu_rwstat(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
struct blkg_rwstat rwstat = { }, tmp;
int i, cpu;
for_each_possible_cpu(cpu) {
struct blkio_group_stats_cpu *sc =
per_cpu_ptr(pd->stats_cpu, cpu);
tmp = blkg_rwstat_read((void *)sc + off);
for (i = 0; i < BLKG_RWSTAT_NR; i++)
rwstat.cnt[i] += tmp.cnt[i];
}
return __blkg_prfill_rwstat(sf, pd, &rwstat);
}
/* print per-cpu blkg_stat specified by BLKCG_STAT_PRIV() */
static int blkcg_print_cpu_stat(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
blkcg_print_blkgs(sf, blkcg, blkg_prfill_cpu_stat,
BLKCG_STAT_POL(cft->private),
BLKCG_STAT_OFF(cft->private), false);
return 0;
}
/* print per-cpu blkg_rwstat specified by BLKCG_STAT_PRIV() */
static int blkcg_print_cpu_rwstat(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
blkcg_print_blkgs(sf, blkcg, blkg_prfill_cpu_rwstat,
BLKCG_STAT_POL(cft->private),
BLKCG_STAT_OFF(cft->private), true);
return 0;
}
#ifdef CONFIG_DEBUG_BLK_CGROUP
static u64 blkg_prfill_avg_queue_size(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
u64 samples = blkg_stat_read(&pd->stats.avg_queue_size_samples);
u64 v = 0;
if (samples) {
v = blkg_stat_read(&pd->stats.avg_queue_size_sum);
do_div(v, samples);
blkcg: introduce blkg_stat and blkg_rwstat blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-02 05:38:42 +08:00
}
__blkg_prfill_u64(sf, pd, v);
return 0;
}
/* print avg_queue_size */
static int blkcg_print_avg_queue_size(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
blkcg_print_blkgs(sf, blkcg, blkg_prfill_avg_queue_size,
BLKIO_POLICY_PROP, 0, false);
return 0;
}
#endif /* CONFIG_DEBUG_BLK_CGROUP */
struct blkg_conf_ctx {
struct gendisk *disk;
struct blkio_group *blkg;
u64 v;
};
/**
* blkg_conf_prep - parse and prepare for per-blkg config update
* @blkcg: target block cgroup
* @input: input string
* @ctx: blkg_conf_ctx to be filled
*
* Parse per-blkg config update from @input and initialize @ctx with the
* result. @ctx->blkg points to the blkg to be updated and @ctx->v the new
* value. This function returns with RCU read locked and must be paired
* with blkg_conf_finish().
*/
static int blkg_conf_prep(struct blkio_cgroup *blkcg, const char *input,
struct blkg_conf_ctx *ctx)
__acquires(rcu)
{
struct gendisk *disk;
struct blkio_group *blkg;
unsigned int major, minor;
unsigned long long v;
int part, ret;
if (sscanf(input, "%u:%u %llu", &major, &minor, &v) != 3)
return -EINVAL;
disk = get_gendisk(MKDEV(major, minor), &part);
if (!disk || part)
return -EINVAL;
blkcg: don't allow or retain configuration of missing devices blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:07 +08:00
rcu_read_lock();
spin_lock_irq(disk->queue->queue_lock);
blkg = blkg_lookup_create(blkcg, disk->queue, false);
spin_unlock_irq(disk->queue->queue_lock);
blkcg: don't allow or retain configuration of missing devices blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:07 +08:00
if (IS_ERR(blkg)) {
ret = PTR_ERR(blkg);
rcu_read_unlock();
put_disk(disk);
/*
* If queue was bypassing, we should retry. Do so after a
* short msleep(). It isn't strictly necessary but queue
* can be bypassing for some time and it's always nice to
* avoid busy looping.
*/
if (ret == -EBUSY) {
msleep(10);
ret = restart_syscall();
}
return ret;
}
ctx->disk = disk;
ctx->blkg = blkg;
ctx->v = v;
return 0;
}
/**
* blkg_conf_finish - finish up per-blkg config update
* @ctx: blkg_conf_ctx intiailized by blkg_conf_prep()
*
* Finish up after per-blkg config update. This function must be paired
* with blkg_conf_prep().
*/
static void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(rcu)
{
rcu_read_unlock();
put_disk(ctx->disk);
}
/* for propio conf */
static u64 blkg_prfill_weight_device(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
if (!pd->conf.weight)
return 0;
return __blkg_prfill_u64(sf, pd, pd->conf.weight);
}
static int blkcg_print_weight_device(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
blkcg_print_blkgs(sf, cgroup_to_blkio_cgroup(cgrp),
blkg_prfill_weight_device, BLKIO_POLICY_PROP, 0,
false);
return 0;
}
static int blkcg_print_weight(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
seq_printf(sf, "%u\n", cgroup_to_blkio_cgroup(cgrp)->weight);
return 0;
}
static int blkcg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
struct blkg_policy_data *pd;
struct blkg_conf_ctx ctx;
int ret;
ret = blkg_conf_prep(blkcg, buf, &ctx);
if (ret)
return ret;
ret = -EINVAL;
pd = ctx.blkg->pd[BLKIO_POLICY_PROP];
if (pd && (!ctx.v || (ctx.v >= BLKIO_WEIGHT_MIN &&
ctx.v <= BLKIO_WEIGHT_MAX))) {
pd->conf.weight = ctx.v;
blkio_update_group_weight(ctx.blkg, BLKIO_POLICY_PROP,
ctx.v ?: blkcg->weight);
ret = 0;
}
blkg_conf_finish(&ctx);
return ret;
}
static int blkcg_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
struct blkio_group *blkg;
struct hlist_node *n;
if (val < BLKIO_WEIGHT_MIN || val > BLKIO_WEIGHT_MAX)
return -EINVAL;
spin_lock(&blkio_list_lock);
spin_lock_irq(&blkcg->lock);
blkcg->weight = (unsigned int)val;
hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
struct blkg_policy_data *pd = blkg->pd[BLKIO_POLICY_PROP];
if (pd && !pd->conf.weight)
blkio_update_group_weight(blkg, BLKIO_POLICY_PROP,
blkcg->weight);
}
spin_unlock_irq(&blkcg->lock);
spin_unlock(&blkio_list_lock);
return 0;
}
/* for blk-throttle conf */
#ifdef CONFIG_BLK_DEV_THROTTLING
static u64 blkg_prfill_conf_u64(struct seq_file *sf,
struct blkg_policy_data *pd, int off)
{
u64 v = *(u64 *)((void *)&pd->conf + off);
if (!v)
return 0;
return __blkg_prfill_u64(sf, pd, v);
}
static int blkcg_print_conf_u64(struct cgroup *cgrp, struct cftype *cft,
struct seq_file *sf)
{
blkcg_print_blkgs(sf, cgroup_to_blkio_cgroup(cgrp),
blkg_prfill_conf_u64, BLKIO_POLICY_THROTL,
cft->private, false);
return 0;
}
static int blkcg_set_conf_u64(struct cgroup *cgrp, struct cftype *cft,
const char *buf, int rw,
void (*update)(struct blkio_group *, int, u64, int))
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgrp);
struct blkg_policy_data *pd;
struct blkg_conf_ctx ctx;
int ret;
ret = blkg_conf_prep(blkcg, buf, &ctx);
if (ret)
return ret;
ret = -EINVAL;
pd = ctx.blkg->pd[BLKIO_POLICY_THROTL];
if (pd) {
*(u64 *)((void *)&pd->conf + cft->private) = ctx.v;
update(ctx.blkg, BLKIO_POLICY_THROTL, ctx.v ?: -1, rw);
ret = 0;
}
blkg_conf_finish(&ctx);
return ret;
}
static int blkcg_set_conf_bps_r(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
return blkcg_set_conf_u64(cgrp, cft, buf, READ, blkio_update_group_bps);
}
static int blkcg_set_conf_bps_w(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
return blkcg_set_conf_u64(cgrp, cft, buf, WRITE, blkio_update_group_bps);
}
static int blkcg_set_conf_iops_r(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
return blkcg_set_conf_u64(cgrp, cft, buf, READ, blkio_update_group_iops);
}
static int blkcg_set_conf_iops_w(struct cgroup *cgrp, struct cftype *cft,
const char *buf)
{
return blkcg_set_conf_u64(cgrp, cft, buf, WRITE, blkio_update_group_iops);
}
#endif
struct cftype blkio_files[] = {
{
.name = "weight_device",
.read_seq_string = blkcg_print_weight_device,
.write_string = blkcg_set_weight_device,
.max_write_len = 256,
},
{
.name = "weight",
.read_seq_string = blkcg_print_weight,
.write_u64 = blkcg_set_weight,
},
{
.name = "time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, time)),
.read_seq_string = blkcg_print_stat,
},
{
.name = "sectors",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats_cpu, sectors)),
.read_seq_string = blkcg_print_cpu_stat,
},
{
.name = "io_service_bytes",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats_cpu, service_bytes)),
.read_seq_string = blkcg_print_cpu_rwstat,
},
{
.name = "io_serviced",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats_cpu, serviced)),
.read_seq_string = blkcg_print_cpu_rwstat,
},
{
.name = "io_service_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, service_time)),
.read_seq_string = blkcg_print_rwstat,
},
{
.name = "io_wait_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, wait_time)),
.read_seq_string = blkcg_print_rwstat,
2010-04-09 14:31:19 +08:00
},
{
.name = "io_merged",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, merged)),
.read_seq_string = blkcg_print_rwstat,
},
{
.name = "io_queued",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, queued)),
.read_seq_string = blkcg_print_rwstat,
},
2010-04-09 14:31:19 +08:00
{
.name = "reset_stats",
.write_u64 = blkiocg_reset_stats,
},
#ifdef CONFIG_BLK_DEV_THROTTLING
{
.name = "throttle.read_bps_device",
.private = offsetof(struct blkio_group_conf, bps[READ]),
.read_seq_string = blkcg_print_conf_u64,
.write_string = blkcg_set_conf_bps_r,
.max_write_len = 256,
},
{
.name = "throttle.write_bps_device",
.private = offsetof(struct blkio_group_conf, bps[WRITE]),
.read_seq_string = blkcg_print_conf_u64,
.write_string = blkcg_set_conf_bps_w,
.max_write_len = 256,
},
{
.name = "throttle.read_iops_device",
.private = offsetof(struct blkio_group_conf, iops[READ]),
.read_seq_string = blkcg_print_conf_u64,
.write_string = blkcg_set_conf_iops_r,
.max_write_len = 256,
},
{
.name = "throttle.write_iops_device",
.private = offsetof(struct blkio_group_conf, iops[WRITE]),
.read_seq_string = blkcg_print_conf_u64,
.write_string = blkcg_set_conf_iops_w,
.max_write_len = 256,
},
{
.name = "throttle.io_service_bytes",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_THROTL,
offsetof(struct blkio_group_stats_cpu, service_bytes)),
.read_seq_string = blkcg_print_cpu_rwstat,
},
{
.name = "throttle.io_serviced",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_THROTL,
offsetof(struct blkio_group_stats_cpu, serviced)),
.read_seq_string = blkcg_print_cpu_rwstat,
},
#endif /* CONFIG_BLK_DEV_THROTTLING */
#ifdef CONFIG_DEBUG_BLK_CGROUP
{
.name = "avg_queue_size",
.read_seq_string = blkcg_print_avg_queue_size,
},
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
{
.name = "group_wait_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, group_wait_time)),
.read_seq_string = blkcg_print_stat,
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
},
{
.name = "idle_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, idle_time)),
.read_seq_string = blkcg_print_stat,
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
},
{
.name = "empty_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, empty_time)),
.read_seq_string = blkcg_print_stat,
blkio: Add more debug-only per-cgroup stats 1) group_wait_time - This is the amount of time the cgroup had to wait to get a timeslice for one of its queues from when it became busy, i.e., went from 0 to 1 request queued. This is different from the io_wait_time which is the cumulative total of the amount of time spent by each IO in that cgroup waiting in the scheduler queue. This stat is a great way to find out any jobs in the fleet that are being starved or waiting for longer than what is expected (due to an IO controller bug or any other issue). 2) empty_time - This is the amount of time a cgroup spends w/o any pending requests. This stat is useful when a job does not seem to be able to use its assigned disk share by helping check if that is happening due to an IO controller bug or because the job is not submitting enough IOs. 3) idle_time - This is the amount of time spent by the IO scheduler idling for a given cgroup in anticipation of a better request than the exising ones from other queues/cgroups. All these stats are recorded using start and stop events. When reading these stats, we do not add the delta between the current time and the last start time if we're between the start and stop events. We avoid doing this to make sure that these numbers are always monotonically increasing when read. Since we're using sched_clock() which may use the tsc as its source, it may induce some inconsistency (due to tsc resync across cpus) if we included the current delta. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-09 12:15:35 +08:00
},
{
.name = "dequeue",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, dequeue)),
.read_seq_string = blkcg_print_stat,
},
{
.name = "unaccounted_time",
.private = BLKCG_STAT_PRIV(BLKIO_POLICY_PROP,
offsetof(struct blkio_group_stats, unaccounted_time)),
.read_seq_string = blkcg_print_stat,
},
#endif
{ } /* terminate */
};
/**
* blkiocg_pre_destroy - cgroup pre_destroy callback
* @cgroup: cgroup of interest
*
* This function is called when @cgroup is about to go away and responsible
* for shooting down all blkgs associated with @cgroup. blkgs should be
* removed while holding both q and blkcg locks. As blkcg lock is nested
* inside q lock, this function performs reverse double lock dancing.
*
* This is the blkcg counterpart of ioc_release_fn().
*/
static int blkiocg_pre_destroy(struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
spin_lock_irq(&blkcg->lock);
while (!hlist_empty(&blkcg->blkg_list)) {
struct blkio_group *blkg = hlist_entry(blkcg->blkg_list.first,
struct blkio_group, blkcg_node);
struct request_queue *q = blkg->q;
if (spin_trylock(q->queue_lock)) {
blkg_destroy(blkg);
spin_unlock(q->queue_lock);
} else {
spin_unlock_irq(&blkcg->lock);
cpu_relax();
spin_lock_irq(&blkcg->lock);
}
}
spin_unlock_irq(&blkcg->lock);
return 0;
}
static void blkiocg_destroy(struct cgroup *cgroup)
{
struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
if (blkcg != &blkio_root_cgroup)
kfree(blkcg);
}
static struct cgroup_subsys_state *blkiocg_create(struct cgroup *cgroup)
{
static atomic64_t id_seq = ATOMIC64_INIT(0);
struct blkio_cgroup *blkcg;
struct cgroup *parent = cgroup->parent;
if (!parent) {
blkcg = &blkio_root_cgroup;
goto done;
}
blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
if (!blkcg)
return ERR_PTR(-ENOMEM);
blkcg->weight = BLKIO_WEIGHT_DEFAULT;
blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
done:
spin_lock_init(&blkcg->lock);
INIT_HLIST_HEAD(&blkcg->blkg_list);
return &blkcg->css;
}
/**
* blkcg_init_queue - initialize blkcg part of request queue
* @q: request_queue to initialize
*
* Called from blk_alloc_queue_node(). Responsible for initializing blkcg
* part of new request_queue @q.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
int blkcg_init_queue(struct request_queue *q)
{
int ret;
might_sleep();
ret = blk_throtl_init(q);
if (ret)
return ret;
mutex_lock(&all_q_mutex);
INIT_LIST_HEAD(&q->all_q_node);
list_add_tail(&q->all_q_node, &all_q_list);
mutex_unlock(&all_q_mutex);
return 0;
}
/**
* blkcg_drain_queue - drain blkcg part of request_queue
* @q: request_queue to drain
*
* Called from blk_drain_queue(). Responsible for draining blkcg part.
*/
void blkcg_drain_queue(struct request_queue *q)
{
lockdep_assert_held(q->queue_lock);
blk_throtl_drain(q);
}
/**
* blkcg_exit_queue - exit and release blkcg part of request_queue
* @q: request_queue being released
*
* Called from blk_release_queue(). Responsible for exiting blkcg part.
*/
void blkcg_exit_queue(struct request_queue *q)
{
mutex_lock(&all_q_mutex);
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
blkg_destroy_all(q, true);
blk_throtl_exit(q);
}
/*
* We cannot support shared io contexts, as we have no mean to support
* two tasks with the same ioc in two different groups without major rework
* of the main cic data structures. For now we allow a task to change
* its cgroup only if it's the only owner of its ioc.
*/
static int blkiocg_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
struct task_struct *task;
struct io_context *ioc;
int ret = 0;
/* task_lock() is needed to avoid races with exit_io_context() */
cgroup_taskset_for_each(task, cgrp, tset) {
task_lock(task);
ioc = task->io_context;
if (ioc && atomic_read(&ioc->nr_tasks) > 1)
ret = -EINVAL;
task_unlock(task);
if (ret)
break;
}
return ret;
}
static void blkcg_bypass_start(void)
__acquires(&all_q_mutex)
{
struct request_queue *q;
mutex_lock(&all_q_mutex);
list_for_each_entry(q, &all_q_list, all_q_node) {
blk_queue_bypass_start(q);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
blkg_destroy_all(q, false);
}
}
static void blkcg_bypass_end(void)
__releases(&all_q_mutex)
{
struct request_queue *q;
list_for_each_entry(q, &all_q_list, all_q_node)
blk_queue_bypass_end(q);
mutex_unlock(&all_q_mutex);
}
struct cgroup_subsys blkio_subsys = {
.name = "blkio",
.create = blkiocg_create,
.can_attach = blkiocg_can_attach,
.pre_destroy = blkiocg_pre_destroy,
.destroy = blkiocg_destroy,
.subsys_id = blkio_subsys_id,
.base_cftypes = blkio_files,
.module = THIS_MODULE,
};
EXPORT_SYMBOL_GPL(blkio_subsys);
void blkio_policy_register(struct blkio_policy_type *blkiop)
{
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
struct request_queue *q;
blkcg_bypass_start();
spin_lock(&blkio_list_lock);
BUG_ON(blkio_policy[blkiop->plid]);
blkio_policy[blkiop->plid] = blkiop;
list_add_tail(&blkiop->list, &blkio_list);
spin_unlock(&blkio_list_lock);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
list_for_each_entry(q, &all_q_list, all_q_node)
update_root_blkg_pd(q, blkiop->plid);
blkcg_bypass_end();
}
EXPORT_SYMBOL_GPL(blkio_policy_register);
void blkio_policy_unregister(struct blkio_policy_type *blkiop)
{
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
struct request_queue *q;
blkcg_bypass_start();
spin_lock(&blkio_list_lock);
BUG_ON(blkio_policy[blkiop->plid] != blkiop);
blkio_policy[blkiop->plid] = NULL;
list_del_init(&blkiop->list);
spin_unlock(&blkio_list_lock);
blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06 05:15:20 +08:00
list_for_each_entry(q, &all_q_list, all_q_node)
update_root_blkg_pd(q, blkiop->plid);
blkcg_bypass_end();
}
EXPORT_SYMBOL_GPL(blkio_policy_unregister);