linux

Commit Graph

Author	SHA1	Message	Date
Mike Galbraith	46151122e0	sched: fix weight calculations The conversion between virtual and real time is as follows: dvt = rw/w * dt <=> dt = w/rw * dvt Since we want the fair sleeper granularity to be in real time, we actually need to do: dvt = - rw/w * l This bug could be related to the regression reported by Yanmin Zhang: \| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots \| of regressions with 2.6.26-rc1: \| \| 1) 8-core stoakley: 28%; \| 2) 16-core tigerton: 20%; \| 3) Itanium Montvale: 50%. Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-08 17:00:42 +02:00
Peter Zijlstra	3e51f33fcc	sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK this replaces the rq->clock stuff (and possibly cpu_clock()). - architectures that have an 'imperfect' hardware clock can set CONFIG_HAVE_UNSTABLE_SCHED_CLOCK - the 'jiffie' window might be superfulous when we update tick_gtod before the __update_sched_clock() call in sched_clock_tick() - cpu_clock() might be implemented as: sched_clock_cpu(smp_processor_id()) if the accuracy proves good enough - how far can TSC drift in a single jiffie when considering the filtering and idle hooks? [ mingo@elte.hu: various fixes and cleanups ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Mike Galbraith	d7dcdc11cf	sched: fix debugging Revert debugging commit `7ba2e74ab5`. print_cfs_rq_tasks() can induce live-lock if a task is dequeued during list traversal. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Gregory Haskins	104f64549c	sched: fix SCHED_FAIR wake-idle logic error We currently use an optimization to skip the overhead of wake-idle processing if more than one task is assigned to a run-queue. The assumption is that the system must already be load-balanced or we wouldnt be overloaded to begin with. The problem is that we are looking at rq->nr_running, which may include RT tasks in addition to CFS tasks. Since the presence of RT tasks really has no bearing on the balance status of CFS tasks, this throws the calculation off. This patch changes the logic to only consider the number of CFS tasks when making the decision to optimze the wake-idle. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:18 +02:00
Harvey Harrison	983ed7a66b	sched: add statics, don't return void expressions Noticed by sparse: kernel/sched.c:760:20: warning: symbol 'sched_feat_names' was not declared. Should it be static? kernel/sched.c:767:5: warning: symbol 'sched_feat_open' was not declared. Should it be static? kernel/sched_fair.c:845:3: warning: returning void-valued expression kernel/sched.c:4386:3: warning: returning void-valued expression Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Peter Zijlstra	a992241de6	sched: fix normalized sleeper Normalized sleeper uses calc_delta*() which requires that the rq load is already updated, so move account_entity_enqueue() before place_entity() Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-05-05 23:56:17 +02:00
Peter Zijlstra	7ba2e74ab5	sched: debug: show a weight tree Print a tree of weights. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	8f1bc385cf	sched: fair: weight calculations In order to level the hierarchy, we need to calculate load based on the root view. That is, each task's load is in the same unit. A / \ B 1 / \ 2 3 To compute 1's load we do: weight(1) -------------- rq_weight(A) To compute 2's load we do: weight(2) weight(B) ------------ * ----------- rq_weight(B) rw_weight(A) This yields load fractions in comparable units. The consequence is that it changes virtual time. We used to have: time_{i} vtime_{i} = ------------ weight_{i} vtime = \Sum vtime_{i} = time / rq_weight. But with the new way of load calculation we get that vtime equals time. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	4a55bd5e97	sched: fair-group: de-couple load-balancing from the rb-trees De-couple load-balancing from the rb-trees, so that I can change their organization. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	ac884dec6d	sched: fair-group scheduling vs latency Currently FAIR_GROUP sched grows the scheduler latency outside of sysctl_sched_latency, invert this so it stays within. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	18d95a2832	sched: fair-group: SMP-nice for group scheduling Implement SMP nice support for the full group hierarchy. On each load-balance action, compile a sched_domain wide view of the full task_group tree. We compute the domain wide view when walking down the hierarchy, and readjust the weights when walking back up. After collecting and readjusting the domain wide view, we try to balance the tasks within the task_groups. The current approach is a naively balance each task group until we've moved the targeted amount of load. Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP paper. XXX: there will be some numerical issues due to the limited nature of SCHED_LOAD_SCALE wrt to representing a task_groups influence on the total weight. When the tree is deep enough, or the task weight small enough, we'll run out of bits. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Abhishek Chandra <chandra@cs.umn.edu> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Hidetoshi Seto	1d3504fcf5	sched, cpuset: customize sched domains, core [rebased for sched-devel/latest] - Add a new cpuset file, having levels: sched_relax_domain_level - Modify partition_sched_domains() and build_sched_domains() to take attributes parameter passed from cpuset. - Fill newidle_idx for node domains which currently unused but might be required if sched_relax_domain_level become higher. - We can change the default level by boot option 'relax_domain_level='. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Peter Zijlstra	b758149c02	sched: prepatory code movement Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:45:00 +02:00
Dhaval Giani	354d60c2ff	sched: mix tasks and groups This patch allows tasks and groups to exist in the same cfs_rq. With this change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N) fairness model where M tasks and N groups exist at the cfs_rq level. [a.p.zijlstra@chello.nl: rt bits and assorted fixes] Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Peter Zijlstra	112f53f5d7	sched: old sleeper bonus Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:59 +02:00
Peter Zijlstra	79b3feffb1	sched: fix regression with sched yield Balbir Singh reported: > 1:mon> t > [c0000000e7677da0] c000000000067de0 .sys_sched_yield+0x6c/0xbc > [c0000000e7677e30] c000000000008748 syscall_exit+0x0/0x40 > --- Exception: c01 (System Call) at 00000400001d09e4 > SP (4000664cb10) is in userspace > 1:mon> r > cpu 0x1: Vector: 300 (Data Access) at [c0000000e7677aa0] > pc: c000000000068e50: .yield_task_fair+0x94/0xc4 > lr: c000000000067de0: .sys_sched_yield+0x6c/0xbc the check that should have avoided that is: /* * Are we the only task in the tree? */ if (unlikely(rq->load.weight == curr->se.load.weight)) return; But I guess that overlooks rt tasks, they also increase the load. So I guess something like this ought to fix it.. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	50df5d6aea	sched: remove sysctl_sched_batch_wakeup_granularity it's unused. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Peter Zijlstra	0bbd3336ee	sched: fix wakeup granularity for buddies The wakeup buddy logic didn't use the same wakeup granularity logic as the wakeup preemption did, this might cause the ->next buddy to be selected past the point where we would have preempted had the task been a single running instance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	018d6db4cb	sched: re-do "sched: fix fair sleepers" re-apply: \| commit `e22ecef1d2` \| Author: Ingo Molnar <mingo@elte.hu> \| Date: Fri Mar 14 22:16:08 2008 +0100 \| \| sched: fix fair sleepers \| \| Fair sleepers need to scale their latency target down by runqueue \| weight. Otherwise busy systems will gain ever larger sleep bonus. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-19 19:44:57 +02:00
Ingo Molnar	e2df9e0905	revert "sched: fix fair sleepers" revert "sched: fix fair sleepers" (`e22ecef1d2`), because it is causing audio skipping, see: http://bugzilla.kernel.org/show_bug.cgi?id=10428 the patch is correct and the real cause of the skipping is not understood (tracing makes it go away), but time has run out so we'll revert it and re-try in 2.6.26. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-04-14 14:26:23 +02:00
Peter Zijlstra	2070ee01d3	sched: cleanup old and rarely used 'debug' features. TREE_AVG and APPROX_AVG are initial task placement policies that have been disabled for a long while.. time to remove them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-21 16:43:47 +01:00
Ingo Molnar	74e3cd7f48	sched: retune wake granularity reduce wake-up granularity for better interactivity. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	4ae7d5cefd	sched: improve affine wakeups improve affine wakeups. Maintain the 'overlap' metric based on CFS's sum_exec_runtime - which means the amount of time a task executes after it wakes up some other task. Use the 'overlap' for the wakeup decisions: if the 'overlap' is short, it means there's strong workload coupling between this task and the woken up task. If the 'overlap' is large then the workload is decoupled and the scheduler will move them to separate CPUs more easily. ( Also slightly move the preempt_check within try_to_wake_up() - this has no effect on functionality but allows 'early wakeups' (for still-on-rq tasks) to be correctly accounted as well.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	f48273860e	sched: clean up wakeup balancing, code flow Clean up the code flow. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:53 +01:00
Ingo Molnar	ac192d3921	sched: clean up wakeup balancing, rename variables rename 'cpu' to 'prev_cpu'. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 09b31c44e9aff8666f72773dc433e2df sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:52 +01:00
Ingo Molnar	098fb9db2c	sched: clean up wakeup balancing, move wake_affine() split out the affine-wakeup bits. No code changed: kernel/sched.o: text data bss dec hex filename 42521 2858 232 45611 b22b sched.o.before 42521 2858 232 45611 b22b sched.o.after md5: 9d76738f1272aa82f0b7affd2f51df6b sched.o.before.asm 09b31c44e9aff8666f72773dc433e2df sched.o.after.asm (the md5's changed because stack slots changed and some registers get scheduled by gcc in a different order - but otherwise the before and after assembly is instruction for instruction equivalent.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-19 04:27:52 +01:00
Ingo Molnar	6a6029b8ce	sched: simplify sched_slice() Use the existing calc_delta_mine() calculation for sched_slice(). This saves a divide and simplifies the code because we share it with the other /cfs_rq->load users. It also improves code size: text data bss dec hex filename 42659 2740 144 45543 b1e7 sched.o.before 42093 2740 144 44977 afb1 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:50 +01:00
Ingo Molnar	e22ecef1d2	sched: fix fair sleepers Fair sleepers need to scale their latency target down by runqueue weight. Otherwise busy systems will gain ever larger sleep bonus. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2008-03-15 03:02:50 +01:00
Peter Zijlstra	aa2ac25229	sched: fix overload performance: buddy wakeups Currently we schedule to the leftmost task in the runqueue. When the runtimes are very short because of some server/client ping-pong, especially in over-saturated workloads, this will cycle through all tasks trashing the cache. Reduce cache trashing by keeping dependent tasks together by running newly woken tasks first. However, by not running the leftmost task first we could starve tasks because the wakee can gain unlimited runtime. Therefore we only run the wakee if its within a small (wakeup_granularity) window of the leftmost task. This preserves fairness, but does alternate server/client task groups. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-15 03:02:50 +01:00
Peter Zijlstra	3fe69747da	sched: min_vruntime fix Current min_vruntime tracking is incorrect and will cause serious problems when we don't run the leftmost task for some reason. min_vruntime does two things; 1) it's used to determine a forward direction when the u64 vruntime wraps, 2) it's used to track the leftmost vruntime to position newly enqueued tasks from. The current logic advances min_vruntime whenever the current task's vruntime advance. Because the current task may pass the leftmost task still waiting we're failing the second goal. This causes new tasks to be placed too far ahead and thus penalizes their runtime. Fix this by making min_vruntime the min_vruntime of the waiting tasks by tracking it in enqueue/dequeue, and compare against current's vruntime to obtain the absolute minimum when placing new tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-15 03:02:49 +01:00
Peter Zijlstra	810b38179e	sched: retain vruntime Kei Tokunaga reported an interactivity problem when moving tasks between control groups. Tasks would retain their old vruntime when moved between groups, this can cause funny lags. Re-set the vruntime on group move to fit within the new tree. Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-07 16:42:59 +01:00
Peter Zijlstra	62fb185130	sched: revert load_balance_monitor() changes The following commits cause a number of regressions: commit `58e2d4ca58` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduling, change how cpu load is calculated commit `6b2d770026` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups Namely: - very frequent wakeups on SMP, reported by PowerTop users. - cacheline trashing on (large) SMP - some latencies larger than 500ms While there is a mergeable patch to fix the latter, the former issues are not fixable in a manner suitable for .25 (we're at -rc3 now). Hence we revert them and try again in v2.6.26. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-04 17:54:06 +01:00
Ingo Molnar	7eee3e677d	sched: clean up __pick_last_entity() a bit Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Balbir Singh	70eee74b70	sched: remove duplicate code from sched_fair.c pick_task_entity() duplicates existing code. This functionality can be easily obtained using rb_last(). Avoid code duplication by using rb_last(). Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Peter Zijlstra	ef9884e6f2	sched: let +nice tasks have smaller impact Michel Dänzr has bisected an interactivity problem with plus-reniced tasks back to this commit: `810e95ccd5` is first bad commit commit `810e95ccd5` Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Oct 15 17:00:14 2007 +0200 sched: another wakeup_granularity fix unit mis-match: wakeup_gran was used against a vruntime fix this by assymetrically scaling the vtime of positive reniced tasks. Bisected-by: Michel Dänzer <michel@tungstengraphics.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-31 22:45:22 +01:00
Srivatsa Vaddagiri	296825cbe1	sched: fix high wake up latencies with FAIR_USER_SCHED The reason why we are getting better wakeup latencies for !FAIR_USER_SCHED is because of this snippet of code in place_entity(): if (!initial) { /* sleeps upto a single latency don't count. / if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se)) ^^^^^^^^^^^^^^^^^^ vruntime -= sysctl_sched_latency; / ensure we never gain time by being placed backwards. */ vruntime = max_vruntime(se->vruntime, vruntime); } NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and not group-level entities. With the patch attached, I could see that wakeup latencies with FAIR_USER_SCHED are restored to the same level as !FAIR_USER_SCHED. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-31 22:45:22 +01:00
Arjan van de Ven	6d082592b6	sched: keep total / count stats in addition to the max for Right now, the linux kernel (with scheduler statistics enabled) keeps track of the maximum time a process is waiting to be scheduled. While the maximum is a very useful metric, tracking average and total is equally useful (at least for latencytop) to figure out the accumulated effect of scheduler delays. The accumulated effect is important to judge the performance impact of scheduler tuning/behavior. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:35 +01:00
Peter Zijlstra	5973e5b954	sched: fix: don't take a mutex from interrupt context print_cfs_stats is callable from interrupt context (sysrq), hence it should not take mutexes. Change it to use RCU since the task group data is RCU freed anyway. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:34 +01:00
Arjan van de Ven	9745512ce7	sched: latencytop support LatencyTOP kernel infrastructure; it measures latencies in the scheduler and tracks it system wide and per process. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:34 +01:00
Peter Zijlstra	8f4d37ec07	sched: high-res preemption tick Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:29 +01:00
Steven Rostedt	cb46984504	sched: RT-balance, add new methods to sched_class Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:22 +01:00
Peter Zijlstra	4bf0b77158	sched: remove do_div() from __sched_slice() Yanmin Zhang noticed a nice optimization: p = l * nr / nl, nl = l/g -> p = g * nr which eliminates a do_div() from __sched_period(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:21 +01:00
Dmitry Adamushko	9ec3b77e11	sched: no need for 'affine wakeup' balancing No need to do a check for 'affine wakeup and passive balancing possibilities' in select_task_rq_fair() when task_cpu(p) == this_cpu. I guess, this part got missed upon introduction of per-sched_class select_task_rq() in try_to_wake_up(). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:21 +01:00
Gregory Haskins	e7693a362e	sched: de-SCHED_OTHER-ize the RT path The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:09 +01:00
Srivatsa Vaddagiri	6b2d770026	sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups The current load balancing scheme isn't good enough for precise group fairness. For example: on a 8-cpu system, I created 3 groups as under: a = 8 tasks (cpu.shares = 1024) b = 4 tasks (cpu.shares = 1024) c = 3 tasks (cpu.shares = 1024) a, b and c are task groups that have equal weight. We would expect each of the groups to receive 33.33% of cpu bandwidth under a fair scheduler. This is what I get with the latest scheduler git tree: Signed-off-by: Ingo Molnar <mingo@elte.hu> -------------------------------------------------------------------------------- Col1 \| Col2 \| Col3 \| Col4 ------\|---------\|-------\|------------------------------------------------------- a \| 277.676 \| 57.8% \| 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5% b \| 116.108 \| 24.2% \| 47.4% 48.1% 48.7% 49.3% c \| 86.326 \| 18.0% \| 47.5% 47.9% 48.5% -------------------------------------------------------------------------------- Explanation of o/p: Col1 -> Group name Col2 -> Cumulative execution time (in seconds) received by all tasks of that group in a 60sec window across 8 cpus Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in percentage. Col3 data is derived as: Col3 = 100 * Col2 / (NR_CPUS * 60) Col4 -> CPU bandwidth received by each individual task of the group. Col4 = 100 * cpu_time_recd_by_task / 60 [I can share the test case that produces a similar o/p if reqd] The deviation from desired group fairness is as below: a = +24.47% b = -9.13% c = -15.33% which is quite high. After the patch below is applied, here are the results: -------------------------------------------------------------------------------- Col1 \| Col2 \| Col3 \| Col4 ------\|---------\|-------\|------------------------------------------------------- a \| 163.112 \| 34.0% \| 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3% b \| 156.220 \| 32.5% \| 63.3% 64.5% 66.1% 66.5% c \| 160.653 \| 33.5% \| 85.8% 90.6% 91.4% -------------------------------------------------------------------------------- Deviation from desired group fairness is as below: a = +0.67% b = -0.83% c = +0.17% which is far better IMO. Most of other runs have yielded a deviation within +-2% at the most, which is good. Why do we see bad (group) fairness with current scheuler? ========================================================= Currently cpu's weight is just the summation of individual task weights. This can yield incorrect results. For ex: consider three groups as below on a 2-cpu system: CPU0 CPU1 --------------------------- A (10) B(5) C(5) --------------------------- Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all of which are on CPU1. Each task has the same weight (NICE_0_LOAD = 1024). The current scheme would yield a cpu weight of 10240 (101024) for each cpu and the load balancer will think both CPUs are perfectly balanced and won't move around any tasks. This, however, would yield this bandwidth: A = 50% B = 25% C = 25% which is not the desired result. What's changing in the patch? ============================= - How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is defined (see below) - API Change - Two tunables introduced in sysfs (under SCHED_DEBUG) to control the frequency at which the load balance monitor thread runs. The basic change made in this patch is how cpu weight (rq->load.weight) is calculated. Its now calculated as the summation of group weights on a cpu, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it and also the number of tasks the group has on that cpu compared to the total number of (runnable) tasks the group has in the system. Let, W(K,i) = Weight of group K on cpu i T(K,i) = Task load present in group K's cfs_rq on cpu i T(K) = Total task load of group K across various cpus S(K) = Shares allocated to group K NRCPUS = Number of online cpus in the scheduler domain to which group K is assigned. Then, W(K,i) = S(K) NRCPUS * T(K,i) / T(K) A load balance monitor thread is created at bootup, which periodically runs and adjusts group's weight on each cpu. To avoid its overhead, two min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which it runs. Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> - don't start the load_balance_monitor when there is only a single cpu. - rename the kthread because its currently longer than TASK_COMM_LEN Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri	58e2d4ca58	sched: group scheduling, change how cpu load is calculated This patch changes how the cpu load exerted by fair_sched_class tasks is calculated. Load exerted by fair_sched_class tasks on a cpu is now a summation of the group weights, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it. This version of patch has a minor impact on code size, but should have no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri	ec2c507fe8	sched: group scheduling, minor fixes Minor bug fixes for the group scheduler: - Use a mutex to serialize add/remove of task groups and also when changing shares of a task group. Use the same mutex when printing cfs_rq debugging stats for various task groups. - Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task group list) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:07:59 +01:00
Ingo Molnar	6cbf1c126c	sched: do not hurt SCHED_BATCH on wakeup measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit if they run the same place_entity() logic as SCHED_OTHER tasks - so uniformize behavior in this area. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-18 15:21:13 +01:00
Ingo Molnar	db292ca302	sched: default to more agressive yield for SCHED_BATCH tasks do more agressive yield for SCHED_BATCH tuned tasks: they are all about throughput anyway. This allows a gentler migration path for any apps that relied on stronger yield. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-04 17:04:39 +01:00
Srivatsa Vaddagiri	d842de871c	sched: cpu accounting controller (V2) Commit `cfb5285660` removed a useful feature for us, which provided a cpu accounting resource controller. This feature would be useful if someone wants to group tasks only for accounting purpose and doesnt really want to exercise any control over their cpu consumption. The patch below reintroduces the feature. It is based on Paul Menage's original patch (Commit `62d0df6406`), with these differences: - Removed load average information. I felt it needs more thought (esp to deal with SMP and virtualized platforms) and can be added for 2.6.25 after more discussions. - Convert group cpu usage to be nanosecond accurate (as rest of the cfs stats are) and invoke cpuacct_charge() from the respective scheduler classes - Make accounting scalable on SMP systems by splitting the usage counter to be per-cpu - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the code is not big enough to warrant a new file and also this rightly needs to live inside the scheduler. Also things like accessing rq->lock while reading cpu usage becomes easier if the code lived in kernel/sched.c) The patch also modifies the cpu controller not to provide the same accounting information. Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran some simple tests like cpuspin (spin on the cpu), ran several tasks in the same group and timed them. Compared their time stamps with cpuacct.usage. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-02 20:04:49 +01:00

1 2 3 4

200 Commits