linux_old1

Commit Graph

Author	SHA1	Message	Date
Tejun Heo	5f46990787	cgroup: update the meaning of cftype->max_write_len cftype->max_write_len is used to extend the maximum size of writes. It's interpreted in such a way that the actual maximum size is one less than the specified value. The default size is defined by CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its value is decremented by 1 and then compared for equality with max size, which means that the actual default size is CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars. There's no point in having a limit that low. Update its definition so that it means the actual string length sans termination and anything below PAGE_SIZE-1 is treated as PAGE_SIZE-1. .max_write_len for "release_agent" is updated to PATH_MAX-1 and cgroup_release_agent_write() is updated so that the redundant strlen() check is removed and it uses strlcpy() instead of strcpy(). .max_write_len initializations in blk-throttle.c and cfq-iosched.c are no longer necessary and removed. The one in cpuset is kept unchanged as it's an approximated value to begin with. This will also make transition to kernfs smoother. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>	2014-02-11 11:52:48 -05:00
Tejun Heo	2da8ca822d	cgroup: replace cftype->read_seq_string() with cftype->seq_show() In preparation of conversion to kernfs, cgroup file handling is updated so that it can be easily mapped to kernfs. This patch replaces cftype->read_seq_string() with cftype->seq_show() which is not limited to single_open() operation and will map directcly to kernfs seq_file interface. The conversions are mechanical. As ->seq_show() doesn't have @css and @cft, the functions which make use of them are converted to use seq_css() and seq_cft() respectively. In several occassions, e.f. if it has seq_string in its name, the function name is updated to fit the new method better. This patch does not introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Neil Horman <nhorman@tuxdriver.com>	2013-12-05 12:28:04 -05:00
Peter Zijlstra	90d3839b90	block: Use u64_stats_init() to initialize seqcounts Now that seqcounts are lockdep enabled objects, we need to explicitly initialize runtime allocated seqcounts so that lockdep can track them. Without this patch, Fengguang was seeing: [ 4.127282] INFO: trying to register non-static key. [ 4.128027] the code is fine but needs lockdep annotation. [ 4.128027] turning off the locking correctness validator. [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2 [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ ... ] [ 4.128027] Call Trace: [ 4.128027] [<7908e744>] ? console_unlock+0x353/0x380 [ 4.128027] [<79dc7cf2>] dump_stack+0x48/0x60 [ 4.128027] [<7908953e>] __lock_acquire.isra.26+0x7e3/0xceb [ 4.128027] [<7908a1c5>] lock_acquire+0x71/0x9a [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<7940658b>] throtl_update_dispatch_stats+0x7c/0x153 [ 4.128027] [<794079aa>] ? blk_throtl_bio+0x1c3/0x485 [ 4.128027] [<794079aa>] blk_throtl_bio+0x1c3/0x485 ... Use u64_stats_init() for all affected data structures, which initializes the seqcount. Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-11-13 13:54:08 +01:00
Linus Torvalds	68cf8d0c72	Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block Pull block IO fixes from Jens Axboe: "After merge window, no new stuff this time only a collection of neatly confined and simple fixes" * 'for-3.12/core' of git://git.kernel.dk/linux-block: cfq: explicitly use 64bit divide operation for 64bit arguments block: Add nr_bios to block_rq_remap tracepoint If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up. bio-integrity: Fix use of bs->bio_integrity_pool after free blkcg: relocate root_blkg setting and clearing block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) block: trace all devices plug operation	2013-09-22 15:00:11 -07:00
Anatol Pomozov	f3cff25f05	cfq: explicitly use 64bit divide operation for 64bit arguments 'samples' is 64bit operant, but do_div() second parameter is 32. do_div silently truncates high 32 bits and calculated result is invalid. In case if low 32bit of 'samples' are zeros then do_div() produces kernel crash. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-22 12:43:47 -06:00
Joe Perches	c1b511eb21	block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) Use the helper function instead of __GFP_ZERO. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-09-11 13:22:03 -06:00
Tejun Heo	182446d087	cgroup: pass around cgroup_subsys_state instead of cgroup in file methods cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>	2013-08-08 20:11:24 -04:00
Jianpeng Ma	d50235b7bc	elevator: Fix a race in elevator switching There's a race between elevator switching and normal io operation. Because the allocation of struct elevator_queue and struct elevator_data don't in a atomic operation.So there are have chance to use NULL ->elevator_data. For example: Thread A: Thread B blk_queu_bio elevator_switch spin_lock_irq(q->queue_block) elevator_alloc elv_merge elevator_init_fn Because call elevator_alloc, it can't hold queue_lock and the ->elevator_data is NULL.So at the same time, threadA call elv_merge and nedd some info of elevator_data.So the crash happened. Move the elevator_alloc into func elevator_init_fn, it make the operations in a atomic operation. Using the follow method can easy reproduce this bug 1:dd if=/dev/sdb of=/dev/null 2:while true;do echo noop > scheduler;echo deadline > scheduler;done The test method also use this method. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-07-03 13:25:24 +02:00
Kent Overstreet	f73a1c7d11	block: Add bio_end_sector() Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2013-03-23 14:15:29 -07:00
Linus Torvalds	ee89f81252	Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block Pull block IO core bits from Jens Axboe: "Below are the core block IO bits for 3.9. It was delayed a few days since my workstation kept crashing every 2-8h after pulling it into current -git, but turns out it is a bug in the new pstate code (divide by zero, will report separately). In any case, it contains: - The big cfq/blkcg update from Tejun and and Vivek. - Additional block and writeback tracepoints from Tejun. - Improvement of the should sort (based on queues) logic in the plug flushing. - _io() variants of the wait_for_completion() interface, using io_schedule() instead of schedule() to contribute to io wait properly. - Various little fixes. You'll get two trivial merge conflicts, which should be easy enough to fix up" Fix up the trivial conflicts due to hlist traversal cleanups (commit b67bfe0d42ca: "hlist: drop the node parameter from iterators"). * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits) block: remove redundant check to bd_openers() block: use i_size_write() in bd_set_size() cfq: fix lock imbalance with failed allocations drivers/block/swim3.c: fix null pointer dereference block: don't select PERCPU_RWSEM block: account iowait time when waiting for completion of IO request sched: add wait_for_completion_io[_timeout] writeback: add more tracepoints block: add block_{touch\|dirty}_buffer tracepoint buffer: make touch_buffer() an exported function block: add @req to bio_{front\|back}_merge tracepoints block: add missing block_bio_complete() tracepoint block: Remove should_sort judgement when flush blk_plug block,elevator: use new hashtable implementation cfq-iosched: add hierarchical cfq_group statistics cfq-iosched: collect stats from dead cfqgs cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock block: RCU free request_queue blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() ...	2013-02-28 12:52:24 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Glauber Costa	a3cc86c2f0	cfq: fix lock imbalance with failed allocations While stress-running very-small container scenarios with the Kernel Memory Controller, I've run into a lockdep-detected lock imbalance in cfq-iosched.c. I'll apologize beforehand for not posting a backlog: I didn't anticipate it would be so hard to reproduce, so I didn't save my serial output and went directly on debugging. Turns out that it did not happen again in more than 20 runs, making it a quite rare pattern. But here is my analysis: When we are in very low-memory situations, we will arrive at cfq_find_alloc_queue and may not find a queue, having to resort to the oom queue, in an rcu-locked condition: if (!cfqq \|\| cfqq == &cfqd->oom_cfqq) [ ... ] Next, we will release the rcu lock, and try to allocate a queue, retrying if we succeed: rcu_read_unlock(); spin_unlock_irq(cfqd->queue->queue_lock); new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask \| __GFP_ZERO, cfqd->queue->node); spin_lock_irq(cfqd->queue->queue_lock); if (new_cfqq) goto retry; We are unlocked at this point, but it should be fine, since we will reacquire the rcu_read_lock when we retry. Except of course, that we may not retry: the allocation may very well fail and we'll keep on going through the flow: The next branch is: if (cfqq) { [ ... ] } else cfqq = &cfqd->oom_cfqq; And right before exiting, we'll issue rcu_read_unlock(). Being already unlocked, this is the likely source of our imbalance. Since cfqq is either already NULL or made NULL in the first statement of the outter branch, the only viable alternative here seems to be to return the oom queue right away in case of allocation failure. Please review the following patch and apply if you agree with my analysis. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-02-22 10:42:46 +01:00
Tejun Heo	43114018cb	cfq-iosched: add hierarchical cfq_group statistics Unfortunately, at this point, there's no way to make the existing statistics hierarchical without creating nasty surprises for the existing users. Just create recursive counterpart of the existing stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	0b39920b5f	cfq-iosched: collect stats from dead cfqgs To support hierarchical stats, it's necessary to remember stats from dead children. Add cfqg->dead_stats and make a dying cfqg transfer its stats to the parent's dead-stats. The transfer happens form ->pd_offline_fn() and it is possible that there are some residual IOs completing afterwards. Currently, we lose these stats. Given that cgroup removal isn't a very high frequency operation and the amount of residual IOs on offline are likely to be nil or small, this shouldn't be a big deal and the complexity needed to handle residual IOs - another callback and rather elaborate synchronization to reach and lock the matching q - doesn't seem justified. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	689665af44	cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Separate out cfqg_stats_reset() which takes struct cfqg_stats * from cfq_pd_reset_stats() and move the latter to where other pd methods are defined. cfqg_stats_reset() will be used to implement hierarchical stats. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:13 -08:00
Tejun Heo	4d5e80a760	blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for summing up stats from multiple blkgs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:12 -08:00
Tejun Heo	d02f7aa8dc	cfq-iosched: enable full blkcg hierarchy support With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	41cad6ab2c	cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction cfq_group_slice() calculates slice by taking a fraction of cfq_target_latency according to the ratio of cfqg->weight against service_tree->total_weight. This currently works only because all cfqgs are treated to be at the same level. To prepare for proper hierarchy support, convert cfq_group_slice() to base the calculation on cfqg->vfraction. As cfqg->vfraction is always a fraction of 1 and represents the fraction allocated to the cfqg with hierarchy considered, the slice can be simply calculated by multiplying cfqg->vfraction to cfq_target_latency (with fixed point shift factored in). As vfraction calculation currently treats all non-root cfqgs as children of the root cfqg, this patch doesn't introduce noticeable behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	1d3650f713	cfq-iosched: implement hierarchy-ready cfq_group charge scaling Currently, cfqg charges are scaled directly according to cfqg->weight. Regardless of the number of active cfqgs or the amount of active weights, a given weight value always scales charge the same way. This works fine as long as all cfqgs are treated equally regardless of their positions in the hierarchy, which is what cfq currently implements. It can't work in hierarchical settings because the interpretation of a given weight value depends on where the weight is located in the hierarchy. This patch reimplements cfqg charge scaling so that it can be used to support hierarchy properly. The scheme is fairly simple and light-weight. * When a cfqg is added to the service tree, v(disktime)weight is calculated. It walks up the tree to root calculating the fraction it has in the hierarchy. At each level, the fraction can be calculated as cfqg->weight / parent->level_weight By compounding these, the global fraction of vdisktime the cfqg has claim to - vfraction - can be determined. * When the cfqg needs to be charged, the charge is scaled inversely proportionally to the vfraction. The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point representation as before; however, the smallest scaling factor is now 1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1 was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller scaling factor. While this shifts the global scale of vdisktime a bit, it doesn't change the relative relationships among cfqgs and the scheduling result isn't different. cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY didn't have any relevance to vdisktime before and is unlikely to cause any visible behavior difference now especially as the scale shift isn't that large. As the new scheme now makes proper distinction between cfqg->weight and ->leaf_weight, reverse the weight aliasing for root cfqgs. For root, both weights are now mapped to ->leaf_weight instead of the other way around. Because we're still using cfqg_flat_parent(), this patch shouldn't change the scheduling behavior in any noticeable way. v2: Beefed up comments on vfraction as requested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	7918ffb5b8	cfq-iosched: implement cfq_group->nr_active and ->children_weight To prepare for blkcg hierarchy support, add cfqg->nr_active and ->children_weight. cfqg->nr_active counts the number of active cfqgs at the cfqg's level and ->children_weight is sum of weights of those cfqgs. The level covers itself (cfqg->leaf_weight) and immediate children. The two values are updated when a cfqg enters and leaves the group service tree. Unless the hierarchy is very deep, the added overhead should be negligible. Currently, the parent is determined using cfqg_flat_parent() which makes the root cfqg the parent of all other cfqgs. This is to make the transition to hierarchy-aware scheduling gradual. Scheduling logic will be converted to use cfqg->children_weight without actually changing the behavior. When everything is ready, blkcg_weight_parent() will be replaced with proper parent function. This patch doesn't introduce any behavior chagne. v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2013-01-09 08:05:11 -08:00
Tejun Heo	e71357e118	cfq-iosched: add leaf_weight cfq blkcg is about to grow proper hierarchy handling, where a child blkg's weight would nest inside the parent's. This makes tasks in a blkg to compete against both tasks in the sibling blkgs and the tasks of child blkgs. We're gonna use the existing weight as the group weight which decides the blkg's weight against its siblings. This patch introduces a new weight - leaf_weight - which decides the weight of a blkg against the child blkgs. It's named leaf_weight because another way to look at it is that each internal blkg nodes have a hidden child leaf node which contains all its tasks and leaf_weight is the weight of the leaf node and handled the same as the weight of the child blkgs. This patch only adds leaf_weight fields and exposes it to userland. The new weight isn't actually used anywhere yet. Note that cfq-iosched currently offcially supports only single level hierarchy and root blkgs compete with the first level blkgs - ie. root weight is basically being used as leaf_weight. For root blkgs, the two weights are kept in sync for backward compatibility. v2: cfqd->root_group->leaf_weight initialization was missing from cfq_init_queue() causing divide by zero when !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com>	2013-01-09 08:05:10 -08:00
Vivek Goyal	b226e5c411	cfq-iosched: Print sync-noidle information in blktrace messages Currently we attach a character "S" or "A" to the cfqq<pid>, to represent whether queues is sync or async. Add one more character "N" to represent whether it is sync-noidle queue or sync queue. So now three different type of queues will look as follows. cfq1234S --> sync queus cfq1234SN --> sync noidle queue cfq1234A --> Async queue Previously S/A classification was being printed only if group scheduling was enabled. This patch also makes sure that this classification is displayed even if group idling is disabled. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	1f23f12151	cfq-iosched: Get rid of unnecessary local variable Use of local varibale "n" seems to be unnecessary. Remove it. This brings it inline with function __cfq_group_st_add(), which is also doing the similar operation of adding a group to a rb tree. No functionality change here. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	6d816ec7c8	cfq-iosched: Rename few functions related to selecting workload choose_service_tree() selects/sets both wl_class and wl_type. Rename it to choose_wl_class_and_type() to make it very clear. cfq_choose_wl() only selects and sets wl_type. It is easy to confuse it with choose_st(). So rename it to cfq_choose_wl_type() to make it clear what does it do. Just renaming. No functionality change. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	34b98d03bd	cfq-iosched: Rename "service_tree" to "st" at some places At quite a few places we use the keyword "service_tree". At some places, especially local variables, I have abbreviated it to "st". Also at couple of places moved binary operator "+" from beginning of line to end of previous line, as per Tejun's feedback. v2: Reverted most of the service tree name change based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	4d2ceea4cb	cfq-iosched: More renaming to better represent wl_class and wl_type Some more renaming. Again making the code uniform w.r.t use of wl_class/class to represent IO class (RT, BE, IDLE) and using wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC). At places this patch shortens the string "workload" to "wl". Renamed "saved_workload" to "saved_wl_type". Renamed "saved_serving_class" to "saved_wl_class". For uniformity with "saved_wl_*" variables, renamed "serving_class" to "serving_wl_class" and renamed "serving_type" to "serving_wl_type". Again, just trying to improve upon code uniformity and improve readability. No functional change. v2: - Restored the usage of keyword "service" based on Jeff Moyer's feedback. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:09 -08:00
Vivek Goyal	3bf10fea3b	cfq-iosched: Properly name all references to IO class Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we are calling workloads belonging to these classes as "prio". This gets very confusing as one starts to associate it with ioprio. So this patch just does bunch of renaming so that reading code becomes easier. All reference to RT, BE and IDLE workload are done using keyword "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made using keyword "type". This makes me feel much better while I am reading the code. There is no functionality change due to this patch. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2013-01-09 08:05:08 -08:00
Shaohua Li	3d106fba2e	block CFQ: avoid moving request to different queue request is queued in cfqq->fifo list. Looks it's possible we are moving a request from one cfqq to another in request merge case. In such case, adjusting the fifo list order doesn't make sense and is impossible if we don't iterate the whole fifo list. My test does hit one case the two cfqq are different, but didn't cause kernel crash, maybe it's because fifo list isn't used frequently. Anyway, from the code logic, this is buggy. I thought we can re-enable the recusive merge logic after this is fixed. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-11-06 12:39:51 +01:00
Tejun Heo	ffea73fc72	block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED cfq may be built w/ or w/o blkcg support depending on CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of related code is ifdef'd out but some part is left dangling - blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register() calls are made on it. Feeding zero filled policy to blkcg_policy_register() is incorrect and triggers the following WARN_ON() if CONFIG_BLK_CGROUP && !CONFIG_CFQ_GROUP_IOSCHED. ------------[ cut here ]------------ WARNING: at block/blk-cgroup.c:867 Modules linked in: Modules linked in: CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1 Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8) Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3 Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000 000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048 000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8 00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70 Krnl Code: 00000000003d76be: a7280001 lhi %r2,1 00000000003d76c2: a7f4ffdf brc 15,3d7680 #00000000003d76c6: a7f40001 brc 15,3d76c8 >00000000003d76ca: a7c8ffea lhi %r12,-22 00000000003d76ce: a7f4ffce brc 15,3d766a 00000000003d76d2: a7f40001 brc 15,3d76d4 00000000003d76d6: a7c80000 lhi %r12,0 00000000003d76da: a7f4ffc2 brc 15,3d765e Call Trace: ([<0000000000b6f000>] initcall_debug+0x0/0x4) [<0000000000989e8a>] cfq_init+0x62/0xd4 [<00000000001000ba>] do_one_initcall+0x3a/0x170 [<000000000096fb60>] kernel_init+0x214/0x2bc [<0000000000623202>] kernel_thread_starter+0x6/0xc [<00000000006231fc>] kernel_thread_starter+0x0/0xc no locks held by swapper/0/1. Last Breaking-Event-Address: [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0 ---[ end trace b8ef4903fcbf9dd3 ]--- This patch fixes the problem by ensuring all blkcg support code is inside CONFIG_CFQ_GROUP_IOSCHED. * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is dropped from blkcg_policy_cfq decl. * blkcg_deactivate_poilcy() invocation is moved inside ifdef. This also makes the activation logic match cfq_init_queue(). * All blkcg_policy_[un]register() invocations are moved inside ifdef. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-04 10:02:29 +02:00
Tejun Heo	fd7949564c	block: fix return value on cfq_init() failure cfq_init() would return zero after kmem cache creation failure. Fix so that it returns -ENOMEM. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-06-04 10:01:38 +02:00
Jens Axboe	0b7877d4ee	Linux 3.4-rc5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2 MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9 JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0= =4d4w -----END PGP SIGNATURE----- Merge tag 'v3.4-rc5' into for-3.5/core The core branch is behind driver commits that we want to build on for 3.5, hence I'm pulling in a later -rc. Linux 3.4-rc5 Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-05-01 14:29:55 +02:00
Tejun Heo	f9fcc2d391	blkcg: collapse blkcg_policy_ops into blkcg_policy There's no reason to keep blkcg_policy_ops separate. Collapse it into blkcg_policy. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	f95a04afa8	blkcg: embed struct blkg_policy_data in policy specific data Currently blkg_policy_data carries policy specific data as char flex array instead of being embedded in policy specific data. This was forced by oddities around blkg allocation which are all gone now. This patch makes blkg_policy_data embedded in policy specific data - throtl_grp and cfq_group so that it's more conventional and consistent with how io_cq is handled. * blkcg_policy->pdata_size is renamed to ->pd_size. * Functions which used to take void pdata now takes struct blkg_policy_data pd. * blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg(). * Dummy struct blkg_policy_data definition added. Dummy pdata_to_blkg() definition was unused and inconsistent with the non-dummy version - correct dummy pd_to_blkg() added. * throtl and cfq updated accordingly. * As dummy blkg_to_pd/pd_to_blkg() are provided, blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd. Moved outside ifdef block. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	3c798398e3	blkcg: mass rename of blkcg API During the recent blkcg cleanup, most of blkcg API has changed to such extent that mass renaming wouldn't cause any noticeable pain. Take the chance and cleanup the naming. * Rename blkio_cgroup to blkcg. * Drop blkio / blkiocg prefixes and consistently use blkcg. * Rename blkio_group to blkcg_gq, which is consistent with io_cq but keep the blkg prefix / variable name. * Rename policy method type and field names to signify they're dealing with policy data. * Rename blkio_policy_type to blkcg_policy. This patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:17 +02:00
Tejun Heo	54e7ed12ba	blkcg: remove blkio_group->path[] blkio_group->path[] stores the path of the associated cgroup and is used only for debug messages. Just format the path from blkg->cgroup when printing debug messages. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:16 +02:00
Tejun Heo	3c96cb32d3	blkcg: drop stuff unused after per-queue policy activation update * All_q_list is unused. Drop all_q_{mutex\|list}. * @for_root of blkg_lookup_create() is always %false when called from outside blk-cgroup.c proper. Factor out __blkg_lookup_create() so that it doesn't check whether @q is bypassing and use the underscored version for the @for_root callsite. * blkg_destroy_all() is used only from blkcg proper and @destroy_root is always %true. Make it static and drop @destroy_root. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	a2b1693bac	blkcg: implement per-queue policy activation All blkcg policies were assumed to be enabled on all request_queues. Due to various implementation obstacles, during the recent blkcg core updates, this was temporarily implemented as shooting down all !root blkgs on elevator switch and policy [de]registration combined with half-broken in-place root blkg updates. In addition to being buggy and racy, this meant losing all blkcg configurations across those events. Now that blkcg is cleaned up enough, this patch replaces the temporary implementation with proper per-queue policy activation. Each blkcg policy should call the new blkcg_[de]activate_policy() to enable and disable the policy on a specific queue. blkcg_activate_policy() allocates and installs policy data for the policy for all existing blkgs. blkcg_deactivate_policy() does the reverse. If a policy is not enabled for a given queue, blkg printing / config functions skip the respective blkg for the queue. blkcg_activate_policy() also takes care of root blkg creation, and cfq_init_queue() and blk_throtl_init() are updated accordingly. This replaces blkcg_bypass_{start\|end}() and update_root_blkg_pd() unnecessary. Dropped. v2: cfq_init_queue() was returning uninitialized @ret on root_group alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	03d8e11142	blkcg: add request_queue->root_blkg With per-queue policy activation, root blkg creation will be moved to blkcg core. Add q->root_blkg in preparation. For blk-throtl, this replaces throtl_data->root_tg; however, cfq needs to keep cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	da8b066262	blkcg: make blkg_conf_prep() take @pol and return with queue lock held Add @pol to blkg_conf_prep() and let it return with queue lock held (to be released by blkg_conf_finish()). Note that @pol isn't used yet. This is to prepare for per-queue policy activation and doesn't cause any visible difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	8bd435b30e	blkcg: remove static policy ID enums Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate @pol->plid dynamically on registration. The maximum number of blkcg policies which can be registered at the same time is defined by BLKCG_MAX_POLS constant added to include/linux/blkdev.h. Note that blkio_policy_register() now may fail. Policy init functions updated accordingly and unnecessary ifdefs removed from cfq_init(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	ec399347d3	blkcg: use @pol instead of @plid in update_root_blkg_pd() and blkcg_print_blkgs() The two functions were taking "enum blkio_policy_id plid". Make them take "const struct blkio_policy_type *pol" instead. This is to prepare for per-queue policy activation and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	f48ec1d788	cfq: fix build breakage & warnings * CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c compile failure when the config is disabled. Move it outside the ifdef block. * Dummy cfqg_stats_*() definitions were lacking inline modifiers causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED. Add them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-20 10:06:06 +02:00
Tejun Heo	5bc4afb1ec	blkcg: drop BLKCG_STAT_{PRIV\|POL\|OFF} macros Now that all stat handling code lives in policy implementations, there's no need to encode policy ID in cft->private. * Export blkcg_prfill_[rw]stat() from blkcg, remove blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which use hard-code BLKIO_POLICY_PROP. * Use cft->private for offset of the target field directly and drop BLKCG_STAT_{PRIV\|POL\|OFF}(). Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:45 -07:00
Tejun Heo	d366e7ec41	blkcg: pass around pd->pdata instead of pd itself in prfill functions Now that all conf and stat fields are moved into policy specific blkio_policy_data->pdata areas, there's no reason to use blkio_policy_data itself in prfill functions. Pass around @pd->pdata instead of @pd. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	3381cb8d2e	blkcg: move blkio_group_conf->weight to cfq blkio_group_conf->weight is owned by cfq and has no reason to be defined in blkcg core. Replace it with cfq_group->dev_weight and let conf setting functions directly set it. If dev_weight is zero, the cfqg doesn't have device specific weight configured. Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want per-policy storage in blkio_cgroup but just mark the ownership of the field for now. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	155fead9b6	blkcg: move blkio_group_stats to cfq-iosched.c blkio_group_stats contains only fields used by cfq and has no reason to be defined in blkcg core. * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats. * blkg_policy_data->stats is replaced with cfq_group->stats. blkg_prfill_[rw]stat() are updated to use offset against pd->pdata instead. * All related macros / functions are renamed so that they have cfqg_ prefix and the unnecessary @pol arguments are dropped. * All stat functions now take cfq_group * instead of blkio_group . lockdep assertion on queue lock dropped. Elevator runs under queue lock by default. There isn't much to be gained by adding lockdep assertions at stat function level. * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method so that cfqg->stats can be reset. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	41b38b6d54	blkcg: cfq doesn't need per-cpu dispatch stats blkio_group_stats_cpu is used to count dispatch stats using per-cpu counters. This is used by both blk-throtl and cfq-iosched but the sharing is rather silly. * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates those stats while holding queue_lock. * blk-throtl needs per-cpu dispatch stats but only service_bytes and serviced. It doesn't make use of sectors. This patch makes cfq add and use global stats for service_bytes, serviced and sectors, removes per-cpu sectors counter and moves per-cpu stat printing code to blk-throttle.c. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	629ed0b102	blkcg: move statistics update code to policies As with conf/stats file handling code, there's no reason for stat update code to live in blkcg core with policies calling into update them. The current organization is both inflexible and complex. This patch moves stat update code to specific policies. All blkiocg_update__stats() functions which deal with BLKIO_POLICY_PROP stats are collapsed into their cfq_blkiocg_update__stats() counterparts. blkiocg_update_dispatch_stats() is used by both policies and duplicated as throtl_update_dispatch_stats() and cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:44 -07:00
Tejun Heo	2ce4d50f9c	cfq: collapse cfq.h into cfq-iosched.c block/cfq.h contains some functions which interact with blkcg; however, this is only part of it and cfq-iosched.c already has quite some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being moved to specific policies, having these relay functions isolated in cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for now. Let's split blkcg support properly later if necessary. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	60c2bc2d5a	blkcg: move conf/stat file handling code to policies blkcg conf/stat handling is convoluted in that details which belong to specific policy implementations are all out in blkcg core and then policies hook into core layer to access and manipulate confs and stats. This sadly achieves both inflexibility (confs/stats can't be modified without messing with blkcg core) and complexity (all the call-ins and call-backs). The previous patches restructured conf and stat handling code such that they can be separated out. This patch relocates the file handling part. All conf/stat file handling code which belongs to BLKIO_POLICY_PROP is moved to cfq-iosched.c and all BKLIO_POLICY_THROTL code to blk-throtl.c. The move is verbatim except for blkio_update_group_{weight\|bps\|iops}() callbacks which relays conf changes to policies. The configuration settings are handled in policies themselves so the relaying isn't necessary. Conf setting functions are modified to directly call per-policy update functions and the relaying mechanism is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:43 -07:00
Tejun Heo	aaec55a002	blkcg: remove unused @pol and @plid parameters @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no longer necessary. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org>	2012-04-01 14:38:42 -07:00
Tao Ma	5bf14c0727	block: Make cfq_target_latency tunable through sysfs. In cfq, when we calculate a time slice for a process(or a cfqq to be precise), we have to consider the cfq_target_latency so that all the sync request have an estimated latency(300ms) and it is controlled by cfq_target_latency. But in some hadoop test, we have found that if there are many processes doing sequential read(24 for example), the throughput is bad because every process can only work for about 25ms and the cfqq is switched. That leads to a higher disk seek. We can achive the good throughput by setting low_latency=0, but then some read's latency is too much for the application. So this patch makes cfq_target_latency tunable through sysfs so that we can tune it and find some magic number which is not bad for both the throughput and the read latency. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-04-01 14:33:39 -07:00
Tejun Heo	eb7d8c07f9	cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up calling blkg_get/put() on dummy cfqg leading to the following crash. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs RIP: 0010:[<ffffffff813d44d8>] [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 RSP: 0018:ffff88001f9dfd80 EFLAGS: 00010046 RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40 RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040) Stack: ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0 ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30 ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507 Call Trace: [<ffffffff813b0507>] elevator_init+0xd7/0x140 [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150 [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80 [<ffffffff813b9523>] blk_init_queue+0x13/0x20 [<ffffffff821aec00>] floppy_init+0x82/0xec7 [<ffffffff810001d2>] do_one_initcall+0x42/0x170 [<ffffffff821835fc>] kernel_init+0xcb/0x14f [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10 Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0 RIP [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED, separate from BLK_CGROUP, blkg access through cfqg needs to be conditioned on it. * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on CFQ_GROUP_IOSCHED. If disabled, they always return %NULL. * Introduce cfqg_get() and cfqg_put() conditioned on CFQ_GROUP_IOSCHED. If disabled, they are noops. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-23 14:02:53 +01:00
Tejun Heo	598971bfbd	cfq: don't use icq_get_changed() cfq caches the associated cfqq's for a given cic. The cache needs to be flushed if the cic's ioprio or blkcg has changed. It is currently done by requiring the changing action to set the respective ICQ__CHANGED bit in the icq and testing it from cfq_set_request(), which involves iterating through all the affected icqs. All cfq wants to know is whether ioprio and/or blkcg have changed since the last flush and can be easily achieved by just remembering the current ioprio and blkcg ID in cic. This patch adds cic->{ioprio\|blkcg_id}, updates all ioprio users to use the remembered value instead, and updates cfq_set_request() path such that, instead of using icq_get_changed(), the current values are compared against the remembered ones and trigger appropriate flush action if not. Condition tests are moved inside both _changed functions which are now named check_ioprio_changed() and check_blkcg_changed(). ioprio.h::task_ioprio() can't be used anymore and replaced with open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:47 +01:00
Tejun Heo	abede6da27	cfq: pass around cfq_io_cq instead of io_context Now that io_cq is managed by block core and guaranteed to exist for any in-flight request, it is easier and carries more information to pass around cfq_io_cq than io_context. This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and cfq_get_queue() to take @cic instead of @ioc. This change removes a duplicate cfq_cic_lookup() from cfq_find_alloc_queue(). This change enables the use of cic-cached ioprio in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-20 12:47:47 +01:00
Tejun Heo	4f85cb96d9	block: make block cgroup policies follow bio task association Implement bio_blkio_cgroup() which returns the blkcg associated with the bio if exists or %current's blkcg, and use it in blk-throttle and cfq-iosched propio. This makes both cgroup policies honor task association for the bio instead of always assuming %current. As nobody is using bio_set_task() yet, this doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	852c788f83	block: implement bio_associate_current() IO scheduling and cgroup are tied to the issuing task via io_context and cgroup of %current. Unfortunately, there are cases where IOs need to be routed via a different task which makes scheduling and cgroup limit enforcement applied completely incorrectly. For example, all bios delayed by blk-throttle end up being issued by a delayed work item and get assigned the io_context of the worker task which happens to serve the work item and dumped to the default block cgroup. This is double confusing as bios which aren't delayed end up in the correct cgroup and makes using blk-throttle and cfq propio together impossible. Any code which punts IO issuing to another task is affected which is getting more and more common (e.g. btrfs). As both io_context and cgroup are firmly tied to task including userland visible APIs to manipulate them, it makes a lot of sense to match up tasks to bios. This patch implements bio_associate_current() which associates the specified bio with %current. The bio will record the associated ioc and blkcg at that point and block layer will use the recorded ones regardless of which task actually ends up issuing the bio. bio release puts the associated ioc and blkcg. It grabs and remembers ioc and blkcg instead of the task itself because task may already be dead by the time the bio is issued making ioc and blkcg inaccessible and those are all block layer cares about. elevator_set_req_fn() is updated such that the bio elvdata is being allocated for is available to the elevator. This doesn't update block cgroup policies yet. Further patches will implement the support. -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in rq_ioc() to fix build breakage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	f6e8d01bee	block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get\|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	c875f4d025	blkcg: drop unnecessary RCU locking Now that blkg additions / removals are always done under both q and blkcg locks, the only places RCU locking is necessary are blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops unncessary RCU locking replacing it with plain blkcg locking as necessary. * blkiocg_pre_destroy() already perform proper locking and don't need RCU. Dropped. * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read lock. This isn't a hot path. * Now unnecessary synchronize_rcu() from queue exit paths removed. This makes q->nr_blkgs unnecessary. Dropped. * RCU annotation on blkg->q removed. -v2: Vivek pointed out that blkg_lookup_create() still needs to be called under rcu_read_lock(). Updated. -v3: After the update, stats_lock locking in blkio_read_blkg_stats() shouldn't be using _irq variant as it otherwise ends up enabling irq while blkcg->lock is locked. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:24 +01:00
Tejun Heo	e8989fae38	blkcg: unify blkg's for blkcg policies Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	03aa264ac1	blkcg: let blkcg core manage per-queue blkg list and counter With the previous patch to move blkg list heads and counters to request_queue and blkg, logic to manage them in both policies are almost identical and can be moved to blkcg core. This patch moves blkg link logic into blkg_lookup_create(), implements common blkg unlink code in blkg_destroy(), and updates blkg_destory_all() so that it's policy specific and can skip root group. The updated blkg_destroy_all() is now used to both clear queue for bypassing and elv switching, and release all blkgs on q exit. This patch introduces a race window where policy [de]registration may race against queue blkg clearing. This can only be a problem on cfq unload and shouldn't be a real problem in practice (and we have many other places where this race already exists). Future patches will remove these unlikely races. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	4eef304998	blkcg: move per-queue blkg list heads and counters to queue and blkg Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	c1768268f9	blkcg: don't use blkg->plid in stat related functions blkg is scheduled to be unified for all policies and thus there won't be one-to-one mapping from blkg to policy. Update stat related functions to take explicit @pol or @plid arguments and not use blkg->plid. This is painful for now but most of specific stat interface functions will be replaced with a handful of generic helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	1adaf3dde3	blkcg: move refcnt to blkcg core Currently, blkcg policy implementations manage blkg refcnt duplicating mostly identical code in both policies. This patch moves refcnt to blkg and let blkcg core handle refcnt and freeing of blkgs. * cfq blkgs now also get freed via RCU. * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If necessary, we can add blkio_exit_group_fn() to resurrect this. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	0381411e4b	blkcg: let blkcg core handle policy private data allocation Currently, blkg's are embedded in private data blkcg policy private data structure and thus allocated and freed by policies. This leads to duplicate codes in policies, hinders implementing common part in blkcg core with strong semantics, and forces duplicate blkg's for the same cgroup-q association. This patch introduces struct blkg_policy_data which is a separate data structure chained from blkg. Policies specifies the amount of private data it needs in its blkio_policy_type->pdata_size and blkcg core takes care of allocating them along with blkg which can be accessed using blkg_to_pdata(). blkg can be determined from pdata using pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated to blkio_init_group_fn(). For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in the reverse direction are added. Except that policy specific data now lives in a separate data structure from blkg, this patch doesn't introduce any functional difference. This will be used to unify blkg's for different policies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	7ee9c56205	blkcg: let blkio_group point to blkio_cgroup directly Currently, blkg points to the associated blkcg via its css_id. This unnecessarily complicates dereferencing blkcg. Let blkg hold a reference to the associated blkcg and point directly to it and disable css_id on blkio_subsys. This change requires splitting blkiocg_destroy() into blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be destroyed and all the blkcg references held by them dropped during cgroup removal. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:23 +01:00
Tejun Heo	7a4dd281ec	blkcg: kill the mind-bending blkg->dev blkg->dev is dev_t recording the device number of the block device for the associated request_queue. It is used to identify the associated block device when printing out configuration or stats. This is redundant to begin with. A blkg is an association between a cgroup and a request_queue and it of course is possible to reach request_queue from blkg and synchronization conventions are in place for safe q dereferencing, so this shouldn't be necessary from the beginning. Furthermore, it's initialized by sscanf()ing the device name of backing_dev_info. The mind boggles. Anyways, if blkg is visible under rcu lock, we know that the associated request_queue hasn't gone away yet and its bdi is registered and alive - blkg can't be created for request_queue which hasn't been fully initialized and it can't go away before blkg is removed. Let stat and conf read functions get device name from blkg->q->backing_dev_info.dev and pass it down to printing functions and remove blkg->dev. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	e56da7e287	blkcg: don't allow or retain configuration of missing devices blkcg is very peculiar in that it allows setting and remembering configurations for non-existent devices by maintaining separate data structures for configuration. This behavior is completely out of the usual norms and outright confusing; furthermore, it uses dev_t number to match the configuration to devices, which is unpredictable to begin with and becomes completely unuseable if EXT_DEVT is fully used. It is wholely unnecessary - we already have fully functional userland mechanism to program devices being hotplugged which has full access to device identification, connection topology and filesystem information. Add a new struct blkio_group_conf which contains all blkcg configurations to blkio_group and let blkio_group, which can be created iff the associated device exists and is removed when the associated device goes away, carry all configurations. Note that, after this patch, all newly created blkg's will always have the default configuration (unlimited for throttling and blkcg's weight for propio). This patch makes blkio_policy_node meaningless but doesn't remove it. The next patch will. -v2: Updated to retry after short sleep if blkg lookup/creation failed due to the queue being temporarily bypassed as indicated by -EBUSY return. Pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	cd1604fab4	blkcg: factor out blkio_group creation Currently both blk-throttle and cfq-iosched implement their own blkio_group creation code in throtl_get_tg() and cfq_get_cfqg(). This patch factors out the common code into blkg_lookup_create(), which returns ERR_PTR value so that transitional failures due to queue bypass can be distinguished from other failures. * New plkio_policy_ops methods blkio_alloc_group_fn() and blkio_link_group_fn added. Both are transitional and will be removed once the blkg management code is fully moved into blk-cgroup.c. * blkio_alloc_group_fn() allocates policy-specific blkg which is usually a larger data structure with blkg as the first entry and intiailizes it. Note that initialization of blkg proper, including percpu stats, is responsibility of blk-cgroup proper. Note that default config (weight, bps...) initialization is done from this method; otherwise, we end up violating locking order between blkcg and q locks via blkcg_get_CONF() functions. * blkio_link_group_fn() is called under queue_lock and responsible for linking the blkg to the queue. blkcg side is handled by blk-cgroup proper. * The common blkg creation function is named blkg_lookup_create() and blkiocg_lookup_group() is renamed to blkg_lookup() for consistency. Also, throtl / cfq related functions are similarly [re]named for consistency. This simplifies blkcg policy implementations and enables further cleanup. -v2: Vivek noticed that blkg_lookup_create() incorrectly tested blk_queue_dead() instead of blk_queue_bypass() leading a user of the function ending up creating a new blkg on bypassing queue. This is a bug introduced while relocating bypass patches before this one. Fixed. -v3: ERR_PTR patch folded into this one. @for_root added to blkg_lookup_create() to allow creating root group on a bypassed queue during elevator switch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	f51b802c17	blkcg: use the usual get blkg path for root blkio_group For root blkg, blk_throtl_init() was using throtl_alloc_tg() explicitly and cfq_init_queue() was manually initializing embedded cfqd->root_group, adding unnecessarily different code paths to blkg handling. Make both use the usual blkio_group get functions - throtl_get_tg() and cfq_get_cfqg() - for the root blkio_group too. Note that blk_throtl_init() callsite is pushed downwards in blk_alloc_queue_node() so that @q is sufficiently initialized for throtl_get_tg(). This simplifies root blkg handling noticeably for cfq and will allow further modularization of blkcg API. -v2: Vivek pointed out that using cfq_get_cfqg() won't work if CONFIG_CFQ_GROUP_IOSCHED is disabled. Fix it by factoring out initialization of base part of cfqg into cfq_init_cfqg_base() and alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	ca32aefc7f	blkcg: use q and plid instead of opaque void * for blkio_group association blkgio_group is association between a block cgroup and a queue for a given policy. Using opaque void * for association makes things confusing and hinders factoring of common code. Use request_queue * and, if necessary, policy id instead. This will help block cgroup API cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	0a5a7d0e32	blkcg: update blkg get functions take blkio_cgroup as parameter In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(), instead of obtaining blkcg of %current explicitly, let the caller specify the blkcg to use as parameter and make both functions hold on to the blkcg. This is part of block cgroup interface cleanup and will help making blkcg API more modular. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	2a7f124414	blkcg: move rcu_read_lock() outside of blkio_group get functions rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto @blkcg while looking up blkg. For API cleanup, the next patch will make the caller responsible for determining @blkcg to look blkg from and let them specify it as a parameter. Move rcu read locking out to the callers to prepare for the change. -v2: Originally this patch was described as a fix for RCU read locking bug around @blkg, which Vivek pointed out to be incorrect. It was from misunderstanding the role of rcu locking as protecting @blkg not @blkcg. Patch description updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	72e06c2551	blkcg: shoot down blkio_groups on elevator switch Elevator switch may involve changes to blkcg policies. Implement shoot down of blkio_groups. Combined with the previous bypass updates, the end goal is updating blkcg core such that it can ensure that blkcg's being affected become quiescent and don't have any per-blkg data hanging around before commencing any policy updates. Until queues are made aware of the policies that applies to them, as an interim step, all per-policy blkg data will be shot down. * blk-throtl doesn't need this change as it can't be disabled for a live queue; however, update it anyway as the scheduled blkg unification requires this behavior change. This means that blk-throtl configuration will be unnecessarily lost over elevator switch. This oddity will be removed after blkcg learns to associate individual policies with request_queues. * blk-throtl dosen't shoot down root_tg. This is to ease transition. Unified blkg will always have persistent root group and not shooting down root_tg for now eases transition to that point by avoiding having to update td->root_tg and is safe as blk-throtl can never be disabled -v2: Vivek pointed out that group list is not guaranteed to be empty on return from clear function if it raced cgroup removal and lost. Fix it by waiting a bit and retrying. This kludge will soon be removed once locking is updated such that blkg is never in limbo state between blkcg and request_queue locks. blk-throtl no longer shoots down root_tg to avoid breaking td->root_tg. Also, Nest queue_lock inside blkio_list_lock not the other way around to avoid introduce possible deadlock via blkcg lock. -v3: blkcg_clear_queue() repositioned and renamed to blkg_destroy_all() to increase consistency with later changes. cfq_clear_queue() updated to check q->elevator before dereferencing it to avoid NULL dereference on not fully initialized queues (used by later change). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:22 +01:00
Tejun Heo	b2fab5acd2	elevator: make elevator_init_fn() return 0/-errno elevator_ops->elevator_init_fn() has a weird return value. It returns a void * which the caller should assign to q->elevator->elevator_data and %NULL return denotes init failure. Update such that it returns integer 0/-errno and sets elevator_data directly as necessary. This makes the interface more conventional and eases further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	b95ada558c	cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED is disabled. This fortunately doesn't collide with blk-throtl as BLKIO_POLICY_PROP is zero but is unnecessary and risky. Just don't register it if not enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-03-06 21:27:21 +01:00
Tejun Heo	d705ae6b13	block: replace icq->changed with icq->flags icq->changed was used for ICQ_*_CHANGED bits. Rename it to flags and access it under ioc->lock instead of using atomic bitops. ioc_get_changed() is added so that the changed part can be fetched and cleared as before. icq->flags will be used to carry other flags. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-15 09:45:49 +01:00
Tejun Heo	07c2bd3735	block: don't call elevator callbacks for plug merges Plug merge calls two elevator callbacks outside queue lock - elevator_allow_merge_fn() and elevator_bio_merged_fn(). Although attempt_plug_merge() suggests that elevator is guaranteed to be there through the existing request on the plug list, nothing prevents plug merge from calling into dying or initializing elevator. For regular merges, bypass ensures elvpriv count to reach zero, which in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER from forced back insertion. Plug merge doesn't check ELVPRIV, and, as the requests haven't gone through elevator insertion yet, it doesn't have SOFTBARRIER set allowing merges on a bypassed queue. This, for example, leads to the following crash during elevator switch. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 PGD 112cbc067 PUD 115d5c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU 1 Modules linked in: deadline_iosched Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs RIP: 0010:[<ffffffff813b34e9>] [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0 RSP: 0018:ffff8801143a38f8 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0 RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8 RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708 R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708 FS: 00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0) Stack: ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1 ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e Call Trace: [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60 [<ffffffff81391bf1>] elv_try_merge+0x21/0x40 [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390 [<ffffffff81396a5a>] generic_make_request+0xca/0x100 [<ffffffff81396b04>] submit_bio+0x74/0x100 [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450 [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60 [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760 [<ffffffff811986b2>] do_sync_read+0xe2/0x120 [<ffffffff81199345>] vfs_read+0xc5/0x180 [<ffffffff81199501>] sys_read+0x51/0x90 [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b There are multiple ways to fix this including making plug merge check ELVPRIV; however, * Calling into elevator outside queue lock is confusing and error-prone. * Requests on plug list aren't known to the elevator. They aren't on the elevator yet, so there's no elevator specific state to update. * Given the nature of plug merges - collecting bio's for the same purpose from the same issuer - elevator specific restrictions aren't applicable. So, simply don't call into elevator methods from plug merge by moving elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and using blk_try_merge() in attempt_plug_merge(). This is based on Jens' patch to skip elevator_allow_merge_fn() from plug merge. Note that this makes per-cgroup merged stats skip plug merging. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <4F16F3CA.90904@kernel.dk> Original-patch-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-08 09:19:42 +01:00
Tejun Heo	11a3122f6c	block: strip out locking optimization in put_io_context() put_io_context() performed a complex trylock dancing to avoid deferring ioc release to workqueue. It was also broken on UP because trylock was always assumed to succeed which resulted in unbalanced preemption count. While there are ways to fix the UP breakage, even the most pathological microbench (forced ioc allocation and tight fork/exit loop) fails to show any appreciable performance benefit of the optimization. Strip it out. If there turns out to be workloads which are affected by this change, simpler optimization from the discussion thread can be applied later. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1328514611.21268.66.camel@sli10-conroe> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-02-07 07:51:30 +01:00
Shaohua Li	df0793abb9	block,cfq: change code order cfq_slice_expired will change saved_workload_slice. It should be called first so saved_workload_slice is correctly set to 0 after workload type is changed. This fixes the code order changed by `54b466e44b`. Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-19 09:20:09 +01:00
Jens Axboe	54b466e44b	cfq-iosched: fix use-after-free of cfqq With the changes in life time management between the cfq IO contexts and the cfq queues, we now risk having cfqd->active_queue being freed when cfq_slice_expired() is being called. cfq_preempt_queue() caches this queue and uses it after calling said function, causing a use-after-free condition. This triggers the following oops, when cfqq_type() attempts to dereference it: BUG: unable to handle kernel paging request at ffff8800746c4f0c IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20 PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC CPU 3 Modules linked in: Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs RIP: 0010:[<ffffffff81266d59>] [<ffffffff81266d59>] cfqq_type+0xb/0x20 RSP: 0018:ffff880079c11778 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000 RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08 RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001 R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0 R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002 FS: 00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040) Stack: ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43 ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98 0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149 Call Trace: [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31 [<ffffffff81254149>] __elv_add_request+0x1ca/0x200 [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312 [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0 [<ffffffff8125907b>] submit_bio+0xbf/0xca [<ffffffff81136ec7>] submit_bh+0xdf/0xfe [<ffffffff81176d04>] ext3_bread+0x50/0x99 [<ffffffff811785b3>] dx_probe+0x38/0x291 [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219 [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406 [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0 [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191 [<ffffffff8117a330>] ext3_lookup+0x39/0xe1 [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5 [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef [<ffffffff8111b557>] path_lookupat+0x59/0x5ea [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a [<ffffffff8111bce0>] do_path_lookup+0x23/0x59 [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99 [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56 [<ffffffff8111d02d>] user_path_at+0x11/0x13 [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64 [<ffffffff8111425a>] vfs_stat+0x1b/0x1d [<ffffffff81114359>] sys_newstat+0x1a/0x33 [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42 [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3 RIP [<ffffffff81266d59>] cfqq_type+0xb/0x20 RSP <ffff880079c11778> CR2: ffff8800746c4f0c Get rid of the caching of cfqd->active_queue, and reorder the check so that it happens before we expire the active queue. Thanks to Tejun for pin pointing the error location. Reported-by: Chris Mason <chris.mason@oracle.com> Tested-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-01-17 21:26:11 +01:00
Shaohua Li	4a0b75c7d0	block, cfq: fix empty queue crash caused by request merge All requests of a queue could be merged to other requests of other queue. Such queue will not have request in it, but it's in service tree. This will cause kernel oops. I encounter a BUG_ON() in cfq_dispatch_request() with next patch, but the issue should exist without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-16 14:00:22 +01:00
Tejun Heo	f1f8cc9465	block, cfq: move icq creation and rq->elv.icq association to block core Now block layer knows everything necessary to create and associate icq's with requests. Move ioc_create_icq() to blk-ioc.c and update get_request() such that, if elevator_type->icq_size is set, requests are automatically associated with their matching icq's before elv_set_request(). io_context reference is also managed by block core on request alloc/free. * Only ioprio/cgroup changed handling remains from cfq_get_cic(). Collapsed into cfq_set_request(). * This removes queue kicking on icq allocation failure (for now). As icq allocation failure is rare and the only effect of queue kicking achieved was possibily accelerating queue processing, this change shouldn't be noticeable. There is a larger underlying problem. Unlike request allocation, icq allocation is not guaranteed to succeed eventually after retries. The number of icq is unbound and thus mempool can't be the solution either. This effectively adds allocation dependency on memory free path and thus possibility of deadlock. This usually wouldn't happen because icq allocation is not a hot path and, even when the condition triggers, it's highly unlikely that none of the writeback workers already has icq. However, this is still possible especially if elevator is being switched under high memory pressure, so we better get it fixed. Probably the only solution is just bypassing elevator and appending to dispatch queue on any elevator allocation failure. * Comment added to explain how icq's are managed and synchronized. This completes cleanup of io_context interface. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	9b84cacd01	block, cfq: restructure io_cq creation path for io_context interface cleanup Add elevator_ops->elevator_init_icq_fn() and restructure cfq_create_cic() and rename it to ioc_create_icq(). The new function expects its caller to pass in io_context, uses elevator_type->icq_cache, handles generic init, calls the new elevator operation for elevator specific initialization, and returns pointer to created or looked up icq. This leaves cfq_icq_pool variable without any user. Removed. This prepares for io_context interface cleanup and doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	7e5a879449	block, cfq: move io_cq exit/release to blk-ioc.c With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc and q, and freeing automatically handled by blk-ioc. The elevator operation only need to perform exit operation specific to the elevator - in cfq's case, exiting the cfqq's. Also, clearing of io_cq's on q detach is moved to block core and automatically performed on elevator switch and q release. Because the q io_cq points to might be freed before RCU callback for the io_cq runs, blk-ioc code should remember to which cache the io_cq needs to be freed when the io_cq is released. New field io_cq->__rcu_icq_cache is added for this purpose. As both the new field and rcu_head are used only after io_cq is released and the q/ioc_node fields aren't, they are put into unions. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	3d3c2379fe	block, cfq: move icq cache management to block core Let elevators set ->icq_size and ->icq_align in elevator_type and elv_register() and elv_unregister() respectively create and destroy kmem_cache for icq. * elv_register() now can return failure. All callers updated. * icq caches are automatically named "ELVNAME_io_cq". * cfq_slab_setup/kill() are collapsed into cfq_init/exit(). * While at it, minor indentation change for iosched_cfq.elevator_name for consistency. This will help moving icq management to block core. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	47fdd4ca96	block, cfq: move io_cq lookup to blk-ioc.c Now that all io_cq related data structures are in block core layer, io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c. Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with parameter return type changes (cfqd -> request_queue, cfq_io_cq -> io_cq) and cfq_cic_lookup() becomes thin wrapper around cfq_cic_lookup(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:42 +01:00
Tejun Heo	a612fddf0d	block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq Most of icq management is about to be moved out of cfq into blk-ioc. This patch prepares for it. * Move cfqd->icq_list to request_queue->icq_list * Make request explicitly point to icq instead of through elevator private data. ->elevator_private[3] is replaced with sub struct elv which contains icq pointer and priv[2]. cfq is updated accordingly. * Meaningless clearing of ->elevator_private[0] removed from elv_set_request(). At that point in code, the field was guaranteed to be %NULL anyway. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	c586980732	block, cfq: reorganize cfq_io_context into generic and cfq specific parts Currently io_context and cfq logics are mixed without clear boundary. Most of io_context is independent from cfq but cfq_io_context handling logic is dispersed between generic ioc code and cfq. cfq_io_context represents association between an io_context and a request_queue, which is a concept useful outside of cfq, but it also contains fields which are useful only to cfq. This patch takes out generic part and put it into io_cq (io context-queue) and the rest into cfq_io_cq (cic moniker remains the same) which contains io_cq. The following changes are made together. * cfq_ttime and cfq_io_cq now live in cfq-iosched.c. * All related fields, functions and constants are renamed accordingly. * ioc->ioc_data is now "struct io_cq " instead of "void " and renamed to icq_hint. This prepares for io_context API cleanup. Documentation is currently sparse. It will be added later. Changes in this patch are mechanical and don't cause functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:41 +01:00
Tejun Heo	f2dbd76a0a	block, cfq: replace current_io_context() with create_io_context() When called under queue_lock, current_io_context() triggers lockdep warning if it hits allocation path. This is because io_context installation is protected by task_lock which is not IRQ safe, so it triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock deadlock warning. Given the restriction, accessor + creator rolled into one doesn't work too well. Drop current_io_context() and let the users access task->io_context directly inside queue_lock combined with explicit creation using create_io_context(). Future ioc updates will further consolidate ioc access and the create interface will be unexported. While at it, relocate ioc internal interface declarations in blk.h and add section comments before and after. This patch does not introduce functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	1238033c79	block, cfq: kill cic->key Now that lazy paths are removed, cfqd_dead_key() is meaningless and cic->q can be used whereever cic->key is used. Kill cic->key. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:40 +01:00
Tejun Heo	b50b636bce	block, cfq: kill ioc_gone Now that cic's are immediately unlinked under both locks, there's no need to count and drain cic's before module unload. RCU callback completion is waited with rcu_barrier(). While at it, remove residual RCU operations on cic_list. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	b9a1920837	block, cfq: remove delayed unlink Now that all cic's are immediately unlinked from both ioc and queue, lazy dropping from lookup path and trimming on elevator unregister are unnecessary. Kill them and remove now unused elevator_ops->trim(). This also leaves call_for_each_cic() without any user. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	b2efa05265	block, cfq: unlink cfq_io_context's immediately cic is association between io_context and request_queue. A cic is linked from both ioc and q and should be destroyed when either one goes away. As ioc and q both have their own locks, locking becomes a bit complex - both orders work for removal from one but not from the other. Currently, cfq tries to circumvent this locking order issue with RCU. ioc->lock nests inside queue_lock but the radix tree and cic's are also protected by RCU allowing either side to walk their lists without grabbing lock. This rather unconventional use of RCU quickly devolves into extremely fragile convolution. e.g. The following is from cfqd going away too soon after ioc and q exits raced. general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: [ 88.503444] Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs RIP: 0010:[<ffffffff81397628>] [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0 ... Call Trace: [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90 [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20 [<ffffffff81389130>] exit_io_context+0x100/0x140 [<ffffffff81098a29>] do_exit+0x579/0x850 [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0 [<ffffffff81098de7>] sys_exit_group+0x17/0x20 [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b The only real hot path here is cic lookup during request initialization and avoiding extra locking requires very confined use of RCU. This patch makes cic removal from both ioc and request_queue perform double-locking and unlink immediately. * From q side, the change is almost trivial as ioc->lock nests inside queue_lock. It just needs to grab each ioc->lock as it walks cic_list and unlink it. * From ioc side, it's a bit more difficult because of inversed lock order. ioc needs its lock to walk its cic_list but can't grab the matching queue_lock and needs to perform unlock-relock dancing. Unlinking is now wholly done from put_io_context() and fast path is optimized by using the queue_lock the caller already holds, which is by far the most common case. If the ioc accessed multiple devices, it tries with trylock. In unlikely cases of fast path failure, it falls back to full double-locking dance from workqueue. Double-locking isn't the prettiest thing in the world but it's far simpler and more understandable than RCU trick without adding any meaningful overhead. This still leaves a lot of now unnecessary RCU logics. Future patches will trim them. -v2: Vivek pointed out that cic->q was being dereferenced after cic->release() was called. Updated to use local variable @this_q instead. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	f1a4f4d35f	block, cfq: fix cic lookup locking * cfq_cic_lookup() may be called without queue_lock and multiple tasks can execute it simultaneously for the same shared ioc. Nothing prevents them racing each other and trying to drop the same dead cic entry multiple times. * smp_wmb() in cfq_exit_cic() doesn't really do anything and nothing prevents cfq_cic_lookup() seeing stale cic->key. This usually doesn't blow up because by the time cic is exited, all requests have been drained and new requests are terminated before going through elevator. However, it can still be triggered by plug merge path which doesn't grab queue_lock and thus can't check DEAD state reliably. This patch updates lookup locking such that, * Lookup is always performed under queue_lock. This doesn't add any more locking. The only issue is cfq_allow_merge() which can be called from plug merge path without holding any lock. For now, this is worked around by using cic of the request to merge into, which is guaranteed to have the same ioc. For longer term, I think it would be best to separate out plug merge method from regular one. * Spurious ioc->lock locking around cic lookup hint assignment dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:39 +01:00
Tejun Heo	216284c352	block, cfq: fix race condition in cic creation path and tighten locking cfq_get_io_context() would fail if multiple tasks race to insert cic's for the same association. This patch restructures cfq_get_io_context() such that slow path insertion race is handled properly. Note that the restructuring also makes cfq_get_io_context() called under queue_lock and performs both ioc and cfqd insertions while holding both ioc and queue locks. This is part of on-going locking tightening and will be used to simplify synchronization rules. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	dc86900e0a	block, cfq: move ioc ioprio/cgroup changed handling to cic ioprio/cgroup change was handled by marking the changed state in ioc and, on the following access to the ioc, performing RCU-protected iteration through all cic's grabbing the matching queue_lock. This patch moves the changed state to each cic. When ioprio or cgroup changes, the respective bit is set on all cic's of the ioc and when each of those cic (not ioc) is accessed, change is applied for that specific ioc-queue pair. This also fixes the following two race conditions between setting and clearing of changed states. * Missing barrier between assign/load of ioprio and ioprio_changed allowed applying old ioprio. * Change requests could happen between application of change and clearing of changed variables. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	283287a52e	block, cfq: misc updates to cfq_io_context Make the following changes to prepare for ioc/cic management cleanup. * Add cic->q so that ioc can determine the associated queue without querying cfq. This will eventually replace ->key. * Factor out cfq_release_cic() from cic_free_func(). This function assumes that the caller handled locking. * Rename __cfq_exit_single_io_context() to cfq_exit_cic() and make it take only @cic. * Restructure cfq_cic_link() for future updates. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	6e736be7f2	block: make ioc get/put interface more conventional and fix race on alloction Ignoring copy_io() during fork, io_context can be allocated from two places - current_io_context() and set_task_ioprio(). The former is always called from local task while the latter can be called from different task. The synchornization between them are peculiar and dubious. * current_io_context() doesn't grab task_lock() and assumes that if it saw %NULL ->io_context, it would stay that way until allocation and assignment is complete. It has smp_wmb() between alloc/init and assignment. * set_task_ioprio() grabs task_lock() for assignment and does smp_read_barrier_depends() between "ioc = task->io_context" and "if (ioc)". Unfortunately, this doesn't achieve anything - the latter is not a dependent load of the former. ie, if ioc itself were being dereferenced "ioc->xxx", it would mean something (not sure what tho) but as the code currently stands, the dependent read barrier is noop. As only one of the the two test-assignment sequences is task_lock() protected, the task_lock() can't do much about race between the two. Nothing prevents current_io_context() and set_task_ioprio() allocating its own ioc for the same task and overwriting the other's. Also, set_task_ioprio() can race with exiting task and create a new ioc after exit_io_context() is finished. ioc get/put doesn't have any reason to be complex. The only hot path is accessing the existing ioc of %current, which is simple to achieve given that ->io_context is never destroyed as long as the task is alive. All other paths can happily go through task_lock() like all other task sub structures without impacting anything. This patch updates ioc get/put so that it becomes more conventional. * alloc_io_context() is replaced with get_task_io_context(). This is the only interface which can acquire access to ioc of another task. On return, the caller has an explicit reference to the object which should be put using put_io_context() afterwards. * The functionality of current_io_context() remains the same but when creating a new ioc, it shares the code path with get_task_io_context() and always goes through task_lock(). * get_io_context() now means incrementing ref on an ioc which the caller already has access to (be that an explicit refcnt or implicit %current one). * PF_EXITING inhibits creation of new io_context and once exit_io_context() is finished, it's guaranteed that both ioc acquisition functions return %NULL. * All users are updated. Most are trivial but smp_read_barrier_depends() removal from cfq_get_io_context() needs a bit of explanation. I suppose the original intention was to ensure ioc->ioprio is visible when set_task_ioprio() allocates new io_context and installs it; however, this wouldn't have worked because set_task_ioprio() doesn't have wmb between init and install. There are other problems with this which will be fixed in another patch. * While at it, use NUMA_NO_NODE instead of -1 for wildcard node specification. -v2: Vivek spotted contamination from debug patch. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:38 +01:00
Tejun Heo	a73f730d01	block, cfq: move cfqd->cic_index to q->id cfq allocates per-queue id using ida and uses it to index cic radix tree from io_context. Move it to q->id and allocate on queue init and free on queue release. This simplifies cfq a bit and will allow for further improvements of io context life-cycle management. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2011-12-14 00:33:37 +01:00

1 2 3 4 5 ...

527 Commits