linux_old1

Commit Graph

Author	SHA1	Message	Date
Peter Zijlstra	62fb185130	sched: revert load_balance_monitor() changes The following commits cause a number of regressions: commit `58e2d4ca58` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduling, change how cpu load is calculated commit `6b2d770026` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Fri Jan 25 21:08:00 2008 +0100 sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups Namely: - very frequent wakeups on SMP, reported by PowerTop users. - cacheline trashing on (large) SMP - some latencies larger than 500ms While there is a mergeable patch to fix the latter, the former issues are not fixable in a manner suitable for .25 (we're at -rc3 now). Hence we revert them and try again in v2.6.26. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-03-04 17:54:06 +01:00
Harvey Harrison	67ca7bde2e	sched: fix signedness warnings in sched.c Unsigned long values are always assigned to switch_count, make it unsigned long. kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness) kernel/sched.c:3897:15: expected long switch_count kernel/sched.c:3897:15: got unsigned long <noident> kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness) kernel/sched.c:3921:16: expected long switch_count kernel/sched.c:3921:16: got unsigned long <noident> Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:17 +01:00
Ingo Molnar	6892b75e60	sched: make early bootup sched_clock() use safer do not call sched_clock() too early. Not only might rq->idle not be set up - but pure per-cpu data might not be accessible either. this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y. Tested-by: Tony Luck <tony.luck@gmail.com> Acked-by: Tony Luck <tony.luck@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-25 16:34:16 +01:00
Linus Torvalds	04e2f1741d	Add memory barrier semantics to wake_up() & co Oleg Nesterov and others have pointed out that on some architectures, the traditional sequence of set_current_state(TASK_INTERRUPTIBLE); if (CONDITION) return; schedule(); is racy wrt another CPU doing CONDITION = 1; wake_up_process(p); because while set_current_state() has a memory barrier separating setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION variable, there is no such memory barrier on the wakeup side. Now, wake_up_process() does actually take a spinlock before it reads and sets the task state on the waking side, and on x86 (and many other architectures) that spinlock is in fact equivalent to a memory barrier, but that is not generally guaranteed. The write that sets CONDITION could move into the critical region protected by the runqueue spinlock. However, adding a smp_wmb() to before the spinlock should now order the writing of CONDITION wrt the lock itself, which in turn is ordered wrt the accesses within the spinlock (which includes the reading of the old state). This should thus close the race (which probably has never been seen in practice, but since smp_wmb() is a no-op on x86, it's not like this will make anything worse either on the most common architecture where the spinlock already gave the required protection). Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 18:05:03 -08:00
Srinivasa Ds	4362758279	kprobes: refuse kprobe insertion on add/sub_preempt_counter() Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these functions inturn call add/sub_preempt_count(). So we need to refuse user from inserting probe in to these functions. This patch disallows user from probing add/sub_preempt_count(). Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-23 17:13:24 -08:00
Peter Zijlstra	b68aa2300c	sched: rt-group: refure unrunnable tasks Refuse to accept or create RT tasks in groups that can't run them. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:40 +01:00
Peter Zijlstra	bccbe08a60	sched: rt-group: clean up the ifdeffery Clean up some of the excessive ifdeffery introduces in the last patch. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:40 +01:00
Peter Zijlstra	052f1dc7eb	sched: rt-group: make rt groups scheduling configurable Make the rt group scheduler compile time configurable. Keep it experimental for now. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:40 +01:00
Peter Zijlstra	9f0c1e560c	sched: rt-group: interface Change the rt_ratio interface to rt_runtime_us, to match rt_period_us. This avoids picking a granularity for the ratio. Extend the /sys/kernel/uids/<uid>/ interface to allow setting the group's rt_runtime. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:39 +01:00
Peter Zijlstra	23b0fdfc92	sched: rt-group: deal with PI Steven mentioned the fun case where a lock holding task will be throttled. Simple fix: allow groups that have boosted tasks to run anyway. If a runnable task in a throttled group gets boosted the dequeue/enqueue done by rt_mutex_setprio() is enough to unthrottle the group. This is ofcourse not quite correct. Two possible ways forward are: - second prio array for boosted tasks - boost to a prio ceiling (this would also work for deadline scheduling) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:39 +01:00
Peter Zijlstra	4cf5d77a6e	sched: fix incorrect irq lock usage in normalize_rt_tasks() lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called from hardirq context through sysrq-n Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:39 +01:00
Peter Zijlstra	8ed3699682	sched: fair-group: separate tg->shares from task_group_lock On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote: > BUG: sleeping function called from invalid context > at /home/den/src/linux-netns26/kernel/mutex.c:209 > in_atomic():1, irqs_disabled():0 > no locks held by swapper/0. > Pid: 0, comm: swapper Not tainted 2.6.24 #304 > > Call Trace: > <IRQ> [<ffffffff80252d1e>] ? __debug_show_held_locks+0x15/0x27 > [<ffffffff8022c2a8>] __might_sleep+0xc0/0xdf > [<ffffffff8049f1df>] mutex_lock_nested+0x28/0x2a9 > [<ffffffff80231294>] sched_destroy_group+0x18/0xea > [<ffffffff8023e835>] sched_destroy_user+0xd/0xf > [<ffffffff8023e8c1>] free_uid+0x8a/0xab > [<ffffffff80233e24>] __put_task_struct+0x3f/0xd3 > [<ffffffff80236708>] delayed_put_task_struct+0x23/0x25 > [<ffffffff8026fda7>] __rcu_process_callbacks+0x8d/0x215 > [<ffffffff8026ff52>] rcu_process_callbacks+0x23/0x44 > [<ffffffff8023a2ae>] __do_softirq+0x79/0xf8 > [<ffffffff8020f8c3>] ? profile_pc+0x2a/0x67 > [<ffffffff8020d38c>] call_softirq+0x1c/0x30 > [<ffffffff8020f689>] do_softirq+0x61/0x9c > [<ffffffff8023a233>] irq_exit+0x51/0x53 > [<ffffffff8021bd1a>] smp_apic_timer_interrupt+0x77/0xad > [<ffffffff8020ce3b>] apic_timer_interrupt+0x6b/0x70 > <EOI> [<ffffffff8020b0dd>] ? default_idle+0x43/0x76 > [<ffffffff8020b0db>] ? default_idle+0x41/0x76 > [<ffffffff8020b09a>] ? default_idle+0x0/0x76 > [<ffffffff8020b186>] ? cpu_idle+0x76/0x98 separate the tg->shares protection from the task_group lock. Reported-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-02-13 15:45:39 +01:00
Harvey Harrison	7ad5b3a505	kernel: remove fastcall in kernel/* [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-02-08 09:22:31 -08:00
Linus Torvalds	75659ca0c1	Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc * 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits) Remove commented-out code copied from NFS NFS: Switch from intr mount option to TASK_KILLABLE Add wait_for_completion_killable Add wait_event_killable Add schedule_timeout_killable Use mutex_lock_killable in vfs_readdir Add mutex_lock_killable Use lock_page_killable Add lock_page_killable Add fatal_signal_pending Add TASK_WAKEKILL exit: Use task_is_* signal: Use task_is_* sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL ptrace: Use task_is_* power: Use task_is_* wait: Use TASK_NORMAL proc/base.c: Use task_is_* proc/array.c: Use TASK_REPORT perfmon: Use task_is_* ... Fixed up conflicts in NFS/sunrpc manually..	2008-02-01 11:45:47 +11:00
Gerald Stralko	5aff0531ee	sched: remove unused params This removes the extra struct task_struct *p parameter in inc_nr_running and dec_nr_running functions. Signed-off by: Jerry Stralko <gerb.stralko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-31 22:45:23 +01:00
Nick Piggin	95c354fe9f	spinlock: lockbreak cleanup The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2008-01-30 13:31:20 +01:00
Nick Piggin	5fb5e6de55	sched: print backtrace of running tasks too The attached patch is something really simple that can sometimes help in getting more info out of a hung system. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:34 +01:00
Guillaume Chazarain	cc203d2422	sched: monitor clock underflows in /proc/sched_debug We monitor clock overflows, let's also monitor clock underflows. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:34 +01:00
Guillaume Chazarain	782daeee3d	sched: fix rq->clock warps on frequency changes sched: fix rq->clock warps on frequency changes Fix `2bacec8c31` (sched: touch softlockup watchdog after idling) that reintroduced warps on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock that checks rq->clock for warps, so call it after adjusting rq->clock. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:33 +01:00
Ingo Molnar	6478d8800b	sched: remove the !PREEMPT_BKL code remove the !PREEMPT_BKL code. this removes 160 lines of legacy code. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:33 +01:00
Peter Zijlstra	48d5e25821	sched: rt throttling vs no_hz We need to teach no_hz about the rt throttling because its tick driven. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:31 +01:00
Peter Zijlstra	6f505b1642	sched: rt group scheduling Extend group scheduling to also cover the realtime classes. It uses the time limiting introduced by the previous patch to allow multiple realtime groups. The hard time limit is required to keep behaviour deterministic. The algorithms used make the realtime scheduler O(tg), linear scaling wrt the number of task groups. This is the worst case behaviour I can't seem to get out of, the avg. case of the algorithms can be improved, I focused on correctness and worst case. [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:30 +01:00
Peter Zijlstra	fa85ae2418	sched: rt time limit Very simple time limit on the realtime scheduling classes. Allow the rq's realtime class to consume sched_rt_ratio of every sched_rt_period slice. If the class exceeds this quota the fair class will preempt the realtime class. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:29 +01:00
Peter Zijlstra	8f4d37ec07	sched: high-res preemption tick Use HR-timers (when available) to deliver an accurate preemption tick. The regular scheduler tick that runs at 1/HZ can be too coarse when nice level are used. The fairness system will still keep the cpu utilisation 'fair' by then delaying the task that got an excessive amount of CPU time but try to minimize this by delivering preemption points spot-on. The average frequency of this extra interrupt is sched_latency / nr_latency. Which need not be higher than 1/HZ, its just that the distribution within the sched_latency period is important. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:29 +01:00
Herbert Xu	02b67cc3ba	sched: do not do cond_resched() when CONFIG_PREEMPT Why do we even have cond_resched when real preemption is on? It seems to be a waste of space and time. remove cond_resched with CONFIG_PREEMPT on. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:28 +01:00
Ingo Molnar	03319ec8b0	sched: documentation, whitespace fixes whitespace fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:28 +01:00
Peter Zijlstra	fa717060f1	sched: sched_rt_entity Move the task_struct members specific to rt scheduling together. A future optimization could be to put sched_entity and sched_rt_entity into a union. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:27 +01:00
Gregory Haskins	dc938520d2	sched: dynamically update the root-domain span/online maps The baseline code statically builds the span maps when the domain is formed. Previous attempts at dynamically updating the maps caused a suspend-to-ram regression, which should now be fixed. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:26 +01:00
Steven Rostedt	cb46984504	sched: RT-balance, add new methods to sched_class Dmitry Adamushko found that the current implementation of the RT balancing code left out changes to the sched_setscheduler and rt_mutex_setprio. This patch addresses this issue by adding methods to the schedule classes to handle being switched out of (switched_from) and being switched into (switched_to) a sched_class. Also a method for changing of priorities is also added (prio_changed). This patch also removes some duplicate logic between rt_mutex_setprio and sched_setscheduler. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:22 +01:00
Steven Rostedt	9a897c5a67	sched: RT-balance, replace hooks with pre/post schedule and wakeup methods To make the main sched.c code more agnostic to the schedule classes. Instead of having specific hooks in the schedule code for the RT class balancing. They are replaced with a pre_schedule, post_schedule and task_wake_up methods. These methods may be used by any of the classes but currently, only the sched_rt class implements them. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:22 +01:00
Dmitry Adamushko	5d2f5a616d	sched: get rid of 'new_cpu' in try_to_wake_up() Clean-up try_to_wake_up(). Get rid of the 'new_cpu' variable in try_to_wake_up() [ that's, one #ifdef section less ]. Also remove a few redundant blank lines. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:21 +01:00
Ingo Molnar	b913176917	sched: add credits for RT balancing improvements add credits for RT balancing improvements. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:19 +01:00
Ingo Molnar	0eab914657	sched: style cleanup, #2 style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 26399 2578 48 29025 7161 sched.o.before 26399 2578 48 29025 7161 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:19 +01:00
Ingo Molnar	d7876a08db	sched: remove unused JIFFIES_TO_NS() macro remove unused JIFFIES_TO_NS() macro. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:19 +01:00
Gregory Haskins	637f50851b	sched: only balance our RT tasks within our domain We move the rt-overload data as the first global to per-domain reclassification. This limits the scope of overload related cache-line bouncing to stay with a specified partition instead of affecting all cpus in the system. Finally, we limit the scope of find_lowest_cpu searches to the domain instead of the entire system. Note that we would always respect domain boundaries even without this patch, but we first would scan potentially all cpus before whittling the list down. Now we can avoid looking at RQs that are out of scope, again reducing cache-line hits. Note: In some cases, task->cpus_allowed will effectively reduce our search to within our domain. However, I believe there are cases where the cpus_allowed mask may be all ones and therefore we err on the side of caution. If it can be optimized later, so be it. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:18 +01:00
Gregory Haskins	57d885fea0	sched: add sched-domain roots We add the notion of a root-domain which will be used later to rescope global variables to per-domain variables. Each exclusive cpuset essentially defines an island domain by fully partitioning the member cpus from any other cpuset. However, we currently still maintain some policy/state as global variables which transcend all cpusets. Consider, for instance, rt-overload state. Whenever a new exclusive cpuset is created, we also create a new root-domain object and move each cpu member to the root-domain's span. By default the system creates a single root-domain with all cpus as members (mimicking the global state we have today). We add some plumbing for storing class specific data in our root-domain. Whenever a RQ is switching root-domains (because of repartitioning) we give each sched_class the opportunity to remove any state from its old domain and add state to the new one. This logic doesn't have any clients yet but it will later in the series. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Christoph Lameter <clameter@sgi.com> CC: Paul Jackson <pj@sgi.com> CC: Simon Derr <simon.derr@bull.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:18 +01:00
Steven Rostedt	0d1311a536	sched: RT-balance on new task rt-balance when creating new tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:14 +01:00
Gregory Haskins	a22d7fc187	sched: wake-balance fixes We have logic to detect whether the system has migratable tasks, but we are not using it when deciding whether to push tasks away. So we add support for considering this new information. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:12 +01:00
Gregory Haskins	e7693a362e	sched: de-SCHED_OTHER-ize the RT path The current wake-up code path tries to determine if it can optimize the wake-up to "this_cpu" by computing load calculations. The problem is that these calculations are only relevant to SCHED_OTHER tasks where load is king. For RT tasks, priority is king. So the load calculation is completely wasted bandwidth. Therefore, we create a new sched_class interface to help with pre-wakeup routing decisions and move the load calculation as a function of CFS task's class. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:09 +01:00
Gregory Haskins	73fe6aae84	sched: add RT-balance cpu-weight Some RT tasks (particularly kthreads) are bound to one specific CPU. It is fairly common for two or more bound tasks to get queued up at the same time. Consider, for instance, softirq_timer and softirq_sched. A timer goes off in an ISR which schedules softirq_thread to run at RT50. Then the timer handler determines that it's time to smp-rebalance the system so it schedules softirq_sched to run. So we are in a situation where we have two RT50 tasks queued, and the system will go into rt-overload condition to request other CPUs for help. This causes two problems in the current code: 1) If a high-priority bound task and a low-priority unbounded task queue up behind the running task, we will fail to ever relocate the unbounded task because we terminate the search on the first unmovable task. 2) We spend precious futile cycles in the fast-path trying to pull overloaded tasks over. It is therefore optimial to strive to avoid the overhead all together if we can cheaply detect the condition before overload even occurs. This patch tries to achieve this optimization by utilizing the hamming weight of the task->cpus_allowed mask. A weight of 1 indicates that the task cannot be migrated. We will then utilize this information to skip non-migratable tasks and to eliminate uncessary rebalance attempts. We introduce a per-rq variable to count the number of migratable tasks that are currently running. We only go into overload if we have more than one rt task, AND at least one of them is migratable. In addition, we introduce a per-task variable to cache the cpus_allowed weight, since the hamming calculation is probably relatively expensive. We only update the cached value when the mask is updated which should be relatively infrequent, especially compared to scheduling frequency in the fast path. Signed-off-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:07 +01:00
Steven Rostedt	4642dafdf9	sched: push RT tasks from overloaded CPUs This patch adds pushing of overloaded RT tasks from a runqueue that is having tasks (most likely RT tasks) added to the run queue. TODO: We don't cover the case of waking of new RT tasks (yet). Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:07 +01:00
Steven Rostedt	f65eda4f78	sched: pull RT tasks from overloaded runqueues This patch adds the algorithm to pull tasks from RT overloaded runqueues. When a pull RT is initiated, all overloaded runqueues are examined for a RT task that is higher in prio than the highest prio task queued on the target runqueue. If another runqueue holds a RT task that is of higher prio than the highest prio task on the target runqueue is found it is pulled to the target runqueue. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:07 +01:00
Steven Rostedt	e8fa136262	sched: add RT task pushing This patch adds an algorithm to push extra RT tasks off a run queue to other CPU runqueues. When more than one RT task is added to a run queue, this algorithm takes an assertive approach to push the RT tasks that are not running onto other run queues that have lower priority. The way this works is that the highest RT task that is not running is looked at and we examine the runqueues on the CPUS for that tasks affinity mask. We find the runqueue with the lowest prio in the CPU affinity of the picked task, and if it is lower in prio than the picked task, we push the task onto that CPU runqueue. We continue pushing RT tasks off the current runqueue until we don't push any more. The algorithm stops when the next highest RT task can't preempt any other processes on other CPUS. TODO: The algorithm may stop when there are still RT tasks that can be migrated. Specifically, if the highest non running RT task CPU affinity is restricted to CPUs that are running higher priority tasks, there may be a lower priority task queued that has an affinity with a CPU that is running a lower priority task that it could be migrated to. This patch set does not address this issue. Note: checkpatch reveals two over 80 character instances. I'm not sure that breaking them up will help visually, so I left them as is. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:05 +01:00
Steven Rostedt	764a9d6fe4	sched: track highest prio task queued This patch adds accounting to each runqueue to keep track of the highest prio task queued on the run queue. We only care about RT tasks, so if the run queue does not contain any active RT tasks its priority will be considered MAX_RT_PRIO. This information will be used for later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:04 +01:00
Steven Rostedt	63489e45e2	sched: count # of queued RT tasks This patch adds accounting to keep track of the number of RT tasks running on a runqueue. This information will be used in later patches. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:03 +01:00
Ingo Molnar	82a1fcb902	softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks this patch extends the soft-lockup detector to automatically detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are printed the following way: ------------------> INFO: task prctl:3042 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message prctl D fd5e3793 0 3042 2997 f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286 f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000 f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b Call Trace: [<c04883a5>] schedule_timeout+0x6d/0x8b [<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17 [<c0133a76>] msleep+0x10/0x16 [<c0138974>] sys_prctl+0x30/0x1e2 [<c0104c52>] sysenter_past_esp+0x5f/0xa5 ======================= 2 locks held by prctl/3042: #0: (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a #1: (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9 <------------------ the current default timeout is 120 seconds. Such messages are printed up to 10 times per bootup. If the system has crashed already then the messages are not printed. if lockdep is enabled then all held locks are printed as well. this feature is a natural extension to the softlockup-detector (kernel locked up without scheduling) and to the NMI watchdog (kernel locked up with IRQs disabled). [ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ] [ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2008-01-25 21:08:02 +01:00
Gautham R Shenoy	95402b3829	cpu-hotplug: replace per-subsystem mutexes with get_online_cpus() This patch converts the known per-subsystem mutexes to get_online_cpus put_online_cpus. It also eliminates the CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE hotplug notification events. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:02 +01:00
Gautham R Shenoy	86ef5c9a8e	cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus() Replace all lock_cpu_hotplug/unlock_cpu_hotplug from the kernel and use get_online_cpus and put_online_cpus instead as it highlights the refcount semantics in these operations. The new API guarantees protection against the cpu-hotplug operation, but it doesn't guarantee serialized access to any of the local data structures. Hence the changes needs to be reviewed. In case of pseries_add_processor/pseries_remove_processor, use cpu_maps_update_begin()/cpu_maps_update_done() as we're modifying the cpu_present_map there. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:02 +01:00
Srivatsa Vaddagiri	6b2d770026	sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups The current load balancing scheme isn't good enough for precise group fairness. For example: on a 8-cpu system, I created 3 groups as under: a = 8 tasks (cpu.shares = 1024) b = 4 tasks (cpu.shares = 1024) c = 3 tasks (cpu.shares = 1024) a, b and c are task groups that have equal weight. We would expect each of the groups to receive 33.33% of cpu bandwidth under a fair scheduler. This is what I get with the latest scheduler git tree: Signed-off-by: Ingo Molnar <mingo@elte.hu> -------------------------------------------------------------------------------- Col1 \| Col2 \| Col3 \| Col4 ------\|---------\|-------\|------------------------------------------------------- a \| 277.676 \| 57.8% \| 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5% b \| 116.108 \| 24.2% \| 47.4% 48.1% 48.7% 49.3% c \| 86.326 \| 18.0% \| 47.5% 47.9% 48.5% -------------------------------------------------------------------------------- Explanation of o/p: Col1 -> Group name Col2 -> Cumulative execution time (in seconds) received by all tasks of that group in a 60sec window across 8 cpus Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in percentage. Col3 data is derived as: Col3 = 100 * Col2 / (NR_CPUS * 60) Col4 -> CPU bandwidth received by each individual task of the group. Col4 = 100 * cpu_time_recd_by_task / 60 [I can share the test case that produces a similar o/p if reqd] The deviation from desired group fairness is as below: a = +24.47% b = -9.13% c = -15.33% which is quite high. After the patch below is applied, here are the results: -------------------------------------------------------------------------------- Col1 \| Col2 \| Col3 \| Col4 ------\|---------\|-------\|------------------------------------------------------- a \| 163.112 \| 34.0% \| 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3% b \| 156.220 \| 32.5% \| 63.3% 64.5% 66.1% 66.5% c \| 160.653 \| 33.5% \| 85.8% 90.6% 91.4% -------------------------------------------------------------------------------- Deviation from desired group fairness is as below: a = +0.67% b = -0.83% c = +0.17% which is far better IMO. Most of other runs have yielded a deviation within +-2% at the most, which is good. Why do we see bad (group) fairness with current scheuler? ========================================================= Currently cpu's weight is just the summation of individual task weights. This can yield incorrect results. For ex: consider three groups as below on a 2-cpu system: CPU0 CPU1 --------------------------- A (10) B(5) C(5) --------------------------- Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all of which are on CPU1. Each task has the same weight (NICE_0_LOAD = 1024). The current scheme would yield a cpu weight of 10240 (101024) for each cpu and the load balancer will think both CPUs are perfectly balanced and won't move around any tasks. This, however, would yield this bandwidth: A = 50% B = 25% C = 25% which is not the desired result. What's changing in the patch? ============================= - How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is defined (see below) - API Change - Two tunables introduced in sysfs (under SCHED_DEBUG) to control the frequency at which the load balance monitor thread runs. The basic change made in this patch is how cpu weight (rq->load.weight) is calculated. Its now calculated as the summation of group weights on a cpu, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it and also the number of tasks the group has on that cpu compared to the total number of (runnable) tasks the group has in the system. Let, W(K,i) = Weight of group K on cpu i T(K,i) = Task load present in group K's cfs_rq on cpu i T(K) = Total task load of group K across various cpus S(K) = Shares allocated to group K NRCPUS = Number of online cpus in the scheduler domain to which group K is assigned. Then, W(K,i) = S(K) NRCPUS * T(K,i) / T(K) A load balance monitor thread is created at bootup, which periodically runs and adjusts group's weight on each cpu. To avoid its overhead, two min/max tunables are introduced (under SCHED_DEBUG) to control the rate at which it runs. Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl> - don't start the load_balance_monitor when there is only a single cpu. - rename the kthread because its currently longer than TASK_COMM_LEN Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri	a183561567	sched: introduce a mutex and corresponding API to serialize access to doms_curarray doms_cur[] array represents various scheduling domains which are mutually exclusive. Currently cpusets code can modify this array (by calling partition_sched_domains()) as a result of user modifying sched_load_balance flag for various cpusets. This patch introduces a mutex and corresponding API (only when CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the doms_cur[] array w/o worrying abt concurrent modifications to the array. The fair group scheduler code (introduced in next patch of this series) makes use of this mutex to walk thr' doms_cur[] array while rebalancing shares of task groups across cpus. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri	58e2d4ca58	sched: group scheduling, change how cpu load is calculated This patch changes how the cpu load exerted by fair_sched_class tasks is calculated. Load exerted by fair_sched_class tasks on a cpu is now a summation of the group weights, rather than summation of task weights. Weight exerted by a group on a cpu is dependent on the shares allocated to it. This version of patch has a minor impact on code size, but should have no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:08:00 +01:00
Srivatsa Vaddagiri	ec2c507fe8	sched: group scheduling, minor fixes Minor bug fixes for the group scheduler: - Use a mutex to serialize add/remove of task groups and also when changing shares of a task group. Use the same mutex when printing cfs_rq debugging stats for various task groups. - Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task group list) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:07:59 +01:00
Srivatsa Vaddagiri	93f992ccc0	sched: group scheduling code cleanup Minor cleanups: - Fix coding style - remove obsolete comment Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-25 21:07:59 +01:00
Ingo Molnar	c61935fd0e	sched: group scheduler, set uid share fix setting cpu share to 1 causes hangs, as reported in: http://bugzilla.kernel.org/show_bug.cgi?id=9779 as the default share is 1024, the values of 0 and 1 can indeed cause problems. Limit it to 2 or higher values. These values can only be set by the root user - but still it makes sense to protect against nonsensical values. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-01-22 11:24:58 +01:00
Roland McGrath	fcfd50afb6	show_task: real_parent The show_task function invoked by sysrq-t et al displays the pid and parent's pid of each task. It seems more useful to show the actual process hierarchy here than who is using ptrace on each process. Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-01-09 08:03:58 -08:00
Ingo Molnar	2bacec8c31	sched: touch softlockup watchdog after idling touch softlockup watchdog after idling. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-18 15:21:13 +01:00
Dmitry Adamushko	051a1d1afa	sched: fix crash on ia64, introduce task_current() Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and sched_move_task()) must handle a given task differently in case it's the 'rq->curr' task on its run-queue. The task_running() interface is not suitable for determining such tasks for platforms with one of the following options: #define __ARCH_WANT_UNLOCKED_CTXSW #define __ARCH_WANT_INTERRUPTS_ON_CTXSW Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but such a task is not necessarily 'rq->curr'. The detailed explanation is available here: https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.html Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>	2007-12-18 15:21:13 +01:00
Ingo Molnar	8ced5f69e4	sched: enable early use of sched_clock() some platforms have sched_clock() implementations that cannot be called very early during wakeup. If it's called it might hang or crash in hard to debug ways. So only call update_rq_clock() [which calls sched_clock()] if sched_init() has already been called. (rq->idle is NULL before the scheduler is initialized.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-07 19:02:47 +01:00
Matthew Wilcox	009e577e07	Add wait_for_completion_killable Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-06 17:40:19 -05:00
Matthew Wilcox	d9514f6c6b	sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL Signed-off-by: Matthew Wilcox <willy@linux.intel.com>	2007-12-06 17:34:59 -05:00
Ingo Molnar	41a2d6cfa3	sched: style cleanups style cleanup of various changes that were done recently. no code changed: text data bss dec hex filename 23680 2542 28 26250 668a sched.o.before 23680 2542 28 26250 668a sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-05 15:46:09 +01:00
Ingo Molnar	77034937dc	sched: fix crash in sys_sched_rr_get_interval() Luiz Fernando N. Capitulino reported that sched_rr_get_interval() crashes for SCHED_OTHER tasks that are on an idle runqueue. The fix is to return a 0 timeslice for tasks that are on an idle runqueue. (and which are not running, obviously) this also shrinks the code a bit: text data bss dec hex filename 47903 3934 336 52173 cbcd sched.o.before 47885 3934 336 52155 cbbb sched.o.after Reported-by: Luiz Fernando N. Capitulino <lcapitulino@mandriva.com.br> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-04 17:04:39 +01:00
Srivatsa Vaddagiri	d842de871c	sched: cpu accounting controller (V2) Commit `cfb5285660` removed a useful feature for us, which provided a cpu accounting resource controller. This feature would be useful if someone wants to group tasks only for accounting purpose and doesnt really want to exercise any control over their cpu consumption. The patch below reintroduces the feature. It is based on Paul Menage's original patch (Commit `62d0df6406`), with these differences: - Removed load average information. I felt it needs more thought (esp to deal with SMP and virtualized platforms) and can be added for 2.6.25 after more discussions. - Convert group cpu usage to be nanosecond accurate (as rest of the cfs stats are) and invoke cpuacct_charge() from the respective scheduler classes - Make accounting scalable on SMP systems by splitting the usage counter to be per-cpu - Move the code from kernel/cpu_acct.c to kernel/sched.c (since the code is not big enough to warrant a new file and also this rightly needs to live inside the scheduler. Also things like accessing rq->lock while reading cpu usage becomes easier if the code lived in kernel/sched.c) The patch also modifies the cpu controller not to provide the same accounting information. Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran some simple tests like cpuspin (spin on the cpu), ran several tasks in the same group and timed them. Compared their time stamps with cpuacct.usage. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-12-02 20:04:49 +01:00
Ingo Molnar	deaf2227dd	sched: clean up, move __sched_text_start/end to sched.h move __sched_text_start/end to sched.h. No code changed: text data bss dec hex filename 26582 2310 28 28920 70f8 sched.o.before 26582 2310 28 28920 70f8 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-28 15:52:56 +01:00
Ingo Molnar	9a4e715914	sched: clean up sd_alloc_ctl_cpu_table() definition clean up sd_alloc_ctl_cpu_table() definition. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-28 15:52:56 +01:00
Ingo Molnar	9612633a21	sched: reorder SCHED_FEAT_ bits reorder SCHED_FEAT_ bits so that the used ones come first. Makes tuning instructions easier. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-15 20:57:40 +01:00
Dmitry Adamushko	94bc9a7bd9	sched: remove activate_idle_task() cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not at the beginning of the queue. So get rid of activate_idle_task() and make use of activate_task() instead. It is the same as activate_task(), except for the update_rq_clock(rq) call that is redundant. Code size goes down: text data bss dec hex filename 47853 3934 336 52123 cb9b sched.o.before 47828 3934 336 52098 cb82 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-15 20:57:40 +01:00
Dmitry Adamushko	ce96b5ac74	sched: fix __set_task_cpu() SMP race Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core system, which crashes can only be explained via runqueue corruption. there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to a new value, task_rq_lock(p, ...) can be successfuly executed on another CPU. We must ensure that updates of per-task data have been completed by this moment. this bug has been hiding in the Linux scheduler for an eternity (we never had any explicit barrier for task->cpu in set_task_cpu() - so the bug was introduced in 2.5.1), but only became visible via set_task_cfs_rq() being accidentally put after the task->cpu update. It also probably needs a sufficiently out-of-order CPU to trigger. Reported-by: Grant Wilson <grant.wilson@zen.co.uk> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-15 20:57:40 +01:00
Oleg Nesterov	dae51f5620	sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED Suppose that the SCHED_FIFO task does switch_uid(new_user); Now, p->se.cfs_rq and p->se.parent both point into the old user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq() for !fair_sched_class case. Suppose that old user_struct/task_group is freed/reused, and the task does sched_setscheduler(SCHED_NORMAL); __setscheduler() sets fair_sched_class, but doesn't update ->se.cfs_rq/parent which point to the freed memory. This means that check_preempt_wakeup() doing while (!is_same_group(se, pse)) { se = parent_entity(se); pse = parent_entity(pse); } may OOPS in a similar way if rq->curr or p did something like above. Perhaps we need something like the patch below, note that __setscheduler() can't do set_task_cfs_rq(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-15 20:57:40 +01:00
Christian Borntraeger	9778385db3	sched: fix accounting of interrupts during guest execution on s390 Currently the scheduler checks for PF_VCPU to decide if this timeslice has to be accounted as guest time. On s390 host interrupts are not disabled during guest execution. This causes theses interrupts to be accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution is to check if an interrupt triggered account_system_time. As the tick is timer interrupt based, we have to subtract hardirq_offset. I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on x86_64. Seems to work. CC: Avi Kivity <avi@qumranet.com> CC: Laurent Vivier <Laurent.Vivier@bull.net> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-15 20:57:39 +01:00
Andrew Morton	cfb5285660	revert "Task Control Groups: example CPU accounting subsystem" Revert `62d0df6406`. This was originally intended as a simple initial example of how to create a control groups subsystem; it wasn't intended for mainline, but I didn't make this clear enough to Andrew. The CFS cgroup subsystem now has better functionality for the per-cgroup usage accounting (based directly on CFS stats) than the "usage" status file in this patch, and the "load" status file is rather simplistic - although having a per-cgroup load average report would be a useful feature, I don't believe this patch actually provides it. If it gets into the final 2.6.24 we'd probably have to support this interface for ever. Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-11-14 18:45:40 -08:00
Adrian Bunk	e6fe6649b4	sched: proper prototype for kernel/sched.c:migration_init() This patch adds a proper prototype for migration_init() in include/linux/sched.h Since there's no point in always returning 0 to a caller that doesn't check the return value it also changes the function to return void. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:39 +01:00
Peter Zijlstra	b82d9fdd84	sched: avoid large irq-latencies in smp-balancing SMP balancing is done with IRQs disabled and can iterate the full rq. When rqs are large this can cause large irq-latencies. Limit the nr of iterations on each run. This fixes a scheduling latency regression reported by the -rt folks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Gregory Haskins <ghaskins@novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:39 +01:00
Ingo Molnar	3e3e13f399	sched: remove PREEMPT_RESTRICT remove PREEMPT_RESTRICT. (this is a separate commit so that any regression related to the removal itself is bisectable) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:39 +01:00
Ingo Molnar	52d3da1ad4	sched: turn off PREEMPT_RESTRICT PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup related preemption. It has a disadvantage though, it can prevent legitimate wakeups if a task is 'unlucky' to be hit too early by a tick that clears peer_preempt. Now that the wakeup preemption has been cleaned up we dont seem to have excessive preemptions anymore, so this feature can be turned off. (and removed in the next patch) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:39 +01:00
Eric Dumazet	d6322faf29	sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC 1) hardcoded 1000000000 value is used five times in places where NSEC_PER_SEC might be more readable. 2) A conversion from nsec to msec uses the hardcoded 1000000 value, which is a candidate for NSEC_PER_MSEC. no code changed: text data bss dec hex filename 44359 3326 36 47721 ba69 sched.o.before 44359 3326 36 47721 ba69 sched.o.after Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:38 +01:00
Ingo Molnar	19978ca610	sched: reintroduce SMP tunings again Yanmin Zhang reported an aim7 regression and bisected it down to: \| commit `38ad464d41` \| Author: Ingo Molnar <mingo@elte.hu> \| Date: Mon Oct 15 17:00:02 2007 +0200 \| \| sched: uniform tunings \| \| use the same defaults on both UP and SMP. fix this by reintroducing similar SMP tunings again. This resolves the regression. (also update the comments to match the ilog2(nr_cpus) tuning effect) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-11-09 22:39:38 +01:00
Ingo Molnar	38605cae99	sched: fix style in kernel/sched.c fallout of recent commits: small coding style fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-29 21:18:11 +01:00
Paul Menage	fe5c7cc228	sched: report CPU usage in CFS cgroup directories Adds a cpu.usage file to the CFS cgroup that reports CPU usage in milliseconds for that cgroup's tasks [ mingo@elte.hu: style cleanups. ] Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-29 21:18:11 +01:00
Srivatsa Vaddagiri	ae8393e508	sched: move rcu_head to task_group struct Peter Zijlstra noticed that the rcu_head object need not be present in every cfs_rq of a group. Move it to the task_group structure instead. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-29 21:18:11 +01:00
James Bottomley	7bae49d498	sched: fix incorrect assumption that cpu 0 exists This patch: commit `9b5b77512d` Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Date: Mon Oct 15 17:00:09 2007 +0200 sched: clean up code under CONFIG_FAIR_GROUP_SCHED Introduced an assumption of the existence of CPU0 via this line cfs_rq = tg->cfs_rq[0]; If you have no CPU0, that will be NULL. The fix seems to be just to take whatever cfs_rq queue comes out of the for_each_possible_cpu() loop, since they're all equally good for the destruction operation. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-29 21:18:11 +01:00
Adrian Bunk	f7402e0361	sched: make kernel/sched.c:account_guest_time() static account_guest_time() can become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-29 21:18:10 +01:00
Peter Williams	681f3e6854	sched: isolate SMP balancing code a bit more At the moment, a lot of load balancing code that is irrelevant to non SMP systems gets included during non SMP builds. This patch addresses this issue and reduces the binary size on non SMP systems: text data bss dec hex filename 10983 28 1192 12203 2fab sched.o.before 10739 28 1192 11959 2eb7 sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:51 +02:00
Peter Williams	e1d1484f72	sched: reduce balance-tasks overhead At the moment, balance_tasks() provides low level functionality for both move_tasks() and move_one_task() (indirectly) via the load_balance() function (in the sched_class interface) which also provides dual functionality. This dual functionality complicates the interfaces and internal mechanisms and makes the run time overhead of operations that are called with two run queue locks held. This patch addresses this issue and reduces the overhead of these operations. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:51 +02:00
Paul Menage	2b01dfe372	sched: clean up some control group code - replace "cont" with "cgrp" in a few places in the CFS cgroup code, - use write_uint rather than write for cpu.shares write function Signed-off-by: Paul Menage <menage@google.com> Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:50 +02:00
Satyam Sharma	838225b48e	sched: use show_regs() to improve __schedule_bug() output A full register dump along with stack backtrace would make the "scheduling while atomic" message more helpful. Use show_regs() instead of dump_stack() for this. We already know we're atomic in here (that is why this function was called) so show_regs()'s atomicity expectations are guaranteed. Also, modify the output of the "BUG: scheduling while atomic:" header a bit to keep task->comm and task->pid together and preempt_count() after them. Signed-off-by: Satyam Sharma <satyam@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:50 +02:00
Ingo Molnar	4dcf6aff02	sched: clean up sched_domain_debug() clean up sched_domain_debug(). this also shrinks the code a bit: text data bss dec hex filename 50474 4306 480 55260 d7dc sched.o.before 50404 4306 480 55190 d796 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:48 +02:00
Ingo Molnar	b15136e949	sched: fix fastcall mismatch in completion APIs Jeff Dike noticed that wait_for_completion_interruptible()'s prototype had a mismatched fastcall. Fix this by removing the fastcall attributes from all the completion APIs. Found-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:48 +02:00
Milton Miller	7378547f2c	sched: fix sched_domain sysctl registration again commit `029190c515` (cpuset sched_load_balance flag) was not tested SCHED_DEBUG enabled as committed as it dereferences NULL when used and it reordered the sysctl registration to cause it to never show any domains or their tunables. Fixes: 1) restore arch_init_sched_domains ordering we can't walk the domains before we build them presently we register cpus with empty directories (no domain directories or files). 2) make unregister_sched_domain_sysctl do nothing when already unregistered detach_destroy_domains is now called one set of cpus at a time unregister_syctl dereferences NULL if called with a null. While the the function would always dereference null if called twice, in the previous code it was always called once and then was followed a register. So only the hidden bug of the sysctl_root_table not being allocated followed by an attempt to free it would have shown the error. 3) always call unregister and register in partition_sched_domains The code is "smart" about unregistering only needed domains. Since we aren't guaranteed any calls to unregister, always unregister. Without calling register on the way out we will not have a table or any sysctl tree. 4) warn if register is called without unregistering The previous table memory is lost, leaving pointers to the later freed memory in sysctl and leaking the memory of the tables. Before this patch on a 2-core 4-thread box compiled for SMT and NUMA, the domains appear empty (there are actually 3 levels per cpu). And as soon as two domains a null pointer is dereferenced (unreliable in this case is stack garbage): bu19a:~# ls -R /proc/sys/kernel/sched_domain/ /proc/sys/kernel/sched_domain/: cpu0 cpu1 cpu2 cpu3 /proc/sys/kernel/sched_domain/cpu0: /proc/sys/kernel/sched_domain/cpu1: /proc/sys/kernel/sched_domain/cpu2: /proc/sys/kernel/sched_domain/cpu3: bu19a:~# mkdir /dev/cpuset bu19a:~# mount -tcpuset cpuset /dev/cpuset/ bu19a:~# cd /dev/cpuset/ bu19a:/dev/cpuset# echo 0 > sched_load_balance bu19a:/dev/cpuset# mkdir one bu19a:/dev/cpuset# echo 1 > one/cpus bu19a:/dev/cpuset# echo 0 > one/sched_load_balance Unable to handle kernel paging request for data at address 0x00000018 Faulting instruction address: 0xc00000000006b608 NIP: c00000000006b608 LR: c00000000006b604 CTR: 0000000000000000 REGS: c000000018d973f0 TRAP: 0300 Not tainted (2.6.23-bml) MSR: 9000000000009032 <EE,ME,IR,DR> CR: 28242442 XER: 00000000 DAR: 0000000000000018, DSISR: 0000000040000000 TASK = c00000001912e340[1987] 'bash' THREAD: c000000018d94000 CPU: 2 .. NIP [c00000000006b608] .unregister_sysctl_table+0x38/0x110 LR [c00000000006b604] .unregister_sysctl_table+0x34/0x110 Call Trace: [c000000018d97670] [c000000007017270] 0xc000000007017270 (unreliable) [c000000018d97720] [c000000000058710] .detach_destroy_domains+0x30/0xb0 [c000000018d977b0] [c00000000005cf1c] .partition_sched_domains+0x1bc/0x230 [c000000018d97870] [c00000000009fdc4] .rebuild_sched_domains+0xb4/0x4c0 [c000000018d97970] [c0000000000a02e8] .update_flag+0x118/0x170 [c000000018d97a80] [c0000000000a1768] .cpuset_common_file_write+0x568/0x820 [c000000018d97c00] [c00000000009d95c] .cgroup_file_write+0x7c/0x180 [c000000018d97cf0] [c0000000000e76b8] .vfs_write+0xe8/0x1b0 [c000000018d97d90] [c0000000000e810c] .sys_write+0x4c/0x90 [c000000018d97e30] [c00000000000852c] syscall_exit+0x0/0x40 Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-24 18:23:48 +02:00
Laurent Vivier	83d87d1673	sched: don't clear PF_VCPU in scheduler KVM clears it by itself now, and for s390 this is plain wrong. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com>	2007-10-22 12:03:29 +02:00
Michael Neuling	6888c1ecd6	kernel/sched.c: remove bogus comment from account_user_time hardirq_offset is no longer needed. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-20 01:41:05 +02:00
Robert P. J. Day	3a4fa0a25d	Fix misspellings of "system", "controller", "interrupt" and "necessary". Fix the various misspellings of "system", controller", "interrupt" and "[un]necessary". Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>	2007-10-19 23:10:43 +02:00
Srivatsa Vaddagiri	68318b8e0b	Hook up group scheduler with control groups Enable "cgroup" (formerly containers) based fair group scheduling. This will let administrator create arbitrary groups of tasks (using "cgroup" pseudo filesystem) and control their cpu bandwidth usage. [akpm@linux-foundation.org: fix cpp condition] Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:51 -07:00
Cliff Wickman	470fd64644	hotplug cpu: migrate a task within its cpuset When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have been running on that cpu. Currently, such a task is migrated: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any cpu which is both online and among that task's cpus_allowed It is typical of a multithreaded application running on a large NUMA system to have its tasks confined to a cpuset so as to cluster them near the memory that they share. Furthermore, it is typical to explicitly place such a task on a specific cpu in that cpuset. And in that case the task's cpus_allowed includes only a single cpu. This patch would insert a preference to migrate such a task to some cpu within its cpuset (and set its cpus_allowed to its entire cpuset). With this patch, migrate the task to: 1) to any cpu on the same node as the disabled cpu, which is both online and among that task's cpus_allowed 2) to any online cpu within the task's cpuset 3) to any cpu which is both online and among that task's cpus_allowed In order to do this, move_task_off_dead_cpu() must make a call to cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will not block. (name change - per Oleg's suggestion) Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set the cpuset mutex during the whole migrate_live_tasks() and migrate_dead_tasks() procedure. [akpm@linux-foundation.org: build fix] [pj@sgi.com: Fix indentation and spacing] Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:44 -07:00
Pavel Emelyanov	ba25f9dcc4	Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:43 -07:00
Eugene Teo	270f722d4d	Fix tsk->exit_state usage tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test is the same as tsk->exit_state & (EXIT_ZOMBIE \| EXIT_DEAD), so just testing tsk->exit_state is sufficient. Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:42 -07:00
Paul Menage	8707d8b8c0	Fix cpusets update_cpumask Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks: - collect batches of tasks under tasklist_lock and then call set_cpus_allowed() on them outside the lock (since this can sleep). - add a simple generic priority heap type to allow efficient collection of batches of tasks to be processed without duplicating or missing any tasks in subsequent batches. - make "cpus" file update a no-op if the mask hasn't changed - fix race between update_cpumask() and sched_setaffinity() by making sched_setaffinity() post-check that it's not running on any cpus outside cpuset_cpus_allowed(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:41 -07:00
Paul Jackson	029190c515	cpuset sched_load_balance flag Add a new per-cpuset flag called 'sched_load_balance'. When enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write "0" to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset. Now even if this flag is disabled for some cpuset, the kernel may still have to load balance some or all the CPUs in that cpuset, if some overlapping cpuset has its sched_load_balance flag enabled. If there are some CPUs that are not in any cpuset whose sched_load_balance flag is enabled, the kernel scheduler will not load balance tasks to those CPUs. Moreover the kernel will partition the 'sched domains' (non-overlapping sets of CPUs over which load balancing is attempted) into the finest granularity partition that it can find, while still keeping any two CPUs that are in the same shed_load_balance enabled cpuset in the same element of the partition. This serves two purposes: 1) It provides a mechanism for real time isolation of some CPUs, and 2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs. This mechanism replaces the earlier overloading of the per-cpuset flag 'cpu_exclusive', which overloading was removed in an earlier patch: cpuset-remove-sched-domain-hooks-from-cpusets See further the Documentation and comments in the code itself. [akpm@linux-foundation.org: don't be weird] Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:41 -07:00
Pavel Emelyanov	228ebcbe63	Uninline find_task_by_xxx set of functions The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:40 -07:00
Pavel Emelyanov	b488893a39	pid namespaces: changes to show virtual ids to user This is the largest patch in the set. Make all (I hope) the places where the pid is shown to or get from user operate on the virtual pids. The idea is: - all in-kernel data structures must store either struct pid itself or the pid's global nr, obtained with pid_nr() call; - when seeking the task from kernel code with the stored id one should use find_task_by_pid() call that works with global pids; - when showing pid's numerical value to the user the virtual one should be used, but however when one shows task's pid outside this task's namespace the global one is to be used; - when getting the pid from userspace one need to consider this as the virtual one and use appropriate task/pid-searching functions. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] [akpm@linux-foundation.org: yet nuther build fix] [akpm@linux-foundation.org: remove unneeded casts] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:40 -07:00
Paul Menage	62d0df6406	Task Control Groups: example CPU accounting subsystem This example demonstrates how to use the generic cgroup subsystem for a simple resource tracker that counts, for the processes in a cgroup, the total CPU time used and the %CPU used in the last complete 10 second interval. Portions contributed by Balbir Singh <balbir@in.ibm.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Linus Torvalds	54e840dd50	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: reduce schedstat variable overhead a bit sched: add KERN_CONT annotation sched: cleanup, make struct rq comments more consistent sched: cleanup, fix spacing sched: fix return value of wait_for_completion_interruptible()	2007-10-18 14:54:03 -07:00
Michael Neuling	c66f08be7e	Add scaled time to taskstats based process accounting This adds items to the taststats struct to account for user and system time based on scaling the CPU frequency and instruction issue rates. Adds account_(user\|system)_time_scaled callbacks which architectures can use to account for time using this mechanism. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:28 -07:00
Ken Chen	480b9434c5	sched: reduce schedstat variable overhead a bit schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint. Most of the fields probably don't need to be full 64-bit on 64-bit arch. Rolling over 4 billion events will most like take a long time and user space tool can be made to accommodate that. I'm proposing kernel to cut back most of variable width on 64-bit system. (note, the following patch doesn't affect 32-bit system). Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:56 +02:00
Ingo Molnar	cc4ea79588	sched: add KERN_CONT annotation printk: add the KERN_CONT annotation (which is empty string but via which checkpatch.pl can notice that the lacking KERN_ level is fine). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:56 +02:00
Ingo Molnar	d801649162	sched: cleanup, make struct rq comments more consistent cleanup, make struct rq comments more consistent. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Ingo Molnar	8401f77505	sched: cleanup, fix spacing cleanup: fix sysctl_sched_features initialization spacing, and fix sd_alloc_ctl_cpu_table() prototype spacing. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Andi Kleen	51e9799023	sched: fix return value of wait_for_completion_interruptible() The recent wait_for_completion() cleanups: commit `8cbbe86dfc` Author: Andi Kleen <ak@suse.de> Date: Mon Oct 15 17:00:14 2007 +0200 sched: cleanup: refactor common code of sleep_on / wait_for_completion Refactor common code of sleep_on / wait_for_completion broke the return value of wait_for_completion_interruptible(). Previously it returned 0 on success, now -1. Fix that. Problem found by Geert Uytterhoeven. [ mingo: fixed whitespace damage ] Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Linus Torvalds	e6d5a11dad	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix new task startup crash sched: fix !SYSFS build breakage sched: fix improper load balance across sched domain sched: more robust sd-sysctl entry freeing	2007-10-17 09:11:18 -07:00
Oleg Nesterov	d2da272a4e	migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock() Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of task_rq_lock(rq->idle), rq->idle can't change its task_rq(). This makes the code a bit more symmetrical with migrate_dead_tasks()'s path which uses spin_lock_irq/spin_unlock_irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Oleg Nesterov	f7b4cddcc5	do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist) Currently move_task_off_dead_cpu() is called under write_lock_irq(tasklist). This means it can't use task_lock() which is needed to improve migrating to take task's ->cpuset into account. Change the code to call move_task_off_dead_cpu() with irqs enabled, and change migrate_live_tasks() to use read_lock(tasklist). This all is a preparation for the futher changes proposed by Cliff Wickman, see http://marc.info/?t=117327786100003 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Srivatsa Vaddagiri	b9dca1e0fc	sched: fix new task startup crash Child task may be added on a different cpu that the one on which parent is running. In which case, task_new_fair() should check whether the new born task's parent entity should be added as well on the cfs_rq. Patch below fixes the problem in task_new_fair. This could fix the put_prev_task_fair() crashes reported. Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reported-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Ken Chen	908a7c1b9b	sched: fix improper load balance across sched domain We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Milton Miller	cd79007634	sched: more robust sd-sysctl entry freeing It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Paul Jackson	607717a65d	cpuset: remove sched domain hooks from cpusets Remove the cpuset hooks that defined sched domains depending on the setting of the 'cpu_exclusive' flag. The cpu_exclusive flag can only be set on a child if it is set on the parent. This made that flag painfully unsuitable for use as a flag defining a partitioning of a system. It was entirely unobvious to a cpuset user what partitioning of sched domains they would be causing when they set that one cpu_exclusive bit on one cpuset, because it depended on what CPUs were in the remainder of that cpusets siblings and child cpusets, after subtracting out other cpu_exclusive cpusets. Furthermore, there was no way on production systems to query the result. Using the cpu_exclusive flag for this was simply wrong from the get go. Fortunately, it was sufficiently borked that so far as I know, almost no successful use has been made of this. One real time group did use it to affectively isolate CPUs from any load balancing efforts. They are willing to adapt to alternative mechanisms for this, such as someway to manipulate the list of isolated CPUs on a running system. They can do without this present cpu_exclusive based mechanism while we develop an alternative. There is a real risk, to the best of my understanding, of users accidentally setting up a partitioned scheduler domains, inhibiting desired load balancing across all their CPUs, due to the nonobvious (from the cpuset perspective) side affects of the cpu_exclusive flag. Furthermore, since there was no way on a running system to see what one was doing with sched domains, this change will be invisible to any using code. Unless they have real insight to the scheduler load balancing choices, they will be unable to detect that this change has been made in the kernel's behaviour. Initial discussion on lkml of this patch has generated much comment. My (probably controversial) take on that discussion is that it has reached a rough concensus that the current cpuset cpu_exclusive mechanism for defining sched domains is borked. There is no concensus on the replacement. But since we can remove this mechanism, and since its continued presence risks causing unwanted partitioning of the schedulers load balancing, we should remove it while we can, as we proceed to work the replacement scheduler domain mechanisms. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:09 -07:00
Mike Travis	d5a7430ddc	Convert cpu_sibling_map to be a per cpu variable Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly from startup and CPU HOTPLUG functions. Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:50 -07:00
Ingo Molnar	9c63d9c021	sched: sync wakeups preempt too make sure sync wakeups preempt too - the scheduler will not overschedule as we've got various throttles against that. As a result, sync wakeups can be used more widely in the kernel (to signal wakeup affinity between tasks), and no arbitrary latencies will be introduced either. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:20 +02:00
Ingo Molnar	71e20f1873	sched: affine sync wakeups make sync wakeups affine for cache-cold tasks: if a cache-cold task is woken up by a sync wakeup then use the opportunity to migrate it straight away. (the two tasks are 'related' because they communicate) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Laurent Vivier	94886b84b1	sched: guest CPU accounting: maintain stats in account_system_time() modify account_system_time() to add cputime to cpustat->guest if we are running a VCPU. We add this cputime to cpustat->user instead of cpustat->system because this part of KVM code is in fact user code although it is executed in the kernel. We duplicate VCPU time between guest and user to allow an unmodified "top(1)" to display correct value. A modified "top(1)" is able to display good cpu user time and cpu guest time by subtracting cpu guest time from cpu user time. Update "gtime" in task_struct accordingly. Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	6323469f9b	sched: domain sysctl fixes: add terminator comment we had an incorrect-terminator bug in sd_alloc_ctl_domain_table() before, so add a comment that documents it. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	ad1cdc1d78	sched: domain sysctl fixes: do not crash on allocation failure Now that we are calling this at runtime, a more relaxed error path is suggested. If an allocation fails, we just register the partial table, which will show empty directories. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	6382bc90f5	sched: domain sysctl fixes: unregister the sysctl table before domains Unregister and free the sysctl table before destroying domains, then rebuild and register after creating the new domains. This prevents the sysctl table from pointing to freed memory for root to write. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	97b6ea7b63	sched: domain sysctl fixes: use for_each_online_cpu() init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu variables. If the cpus_possible mask is not contigious this will result in a crash referencing unallocated data. If the online mask is not contigious then we would show offline cpus and miss online ones. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Milton Miller	5cf9f062c8	sched: domain sysctl fixes: use kcalloc() kcalloc checks for n * sizeof(element) overflows and it zeros. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:19 +02:00
Ingo Molnar	6bc1665ba7	sched: allow the immediate migration of cache-cold tasks allow the immediate migration of cache-cold tasks. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	cc367732ff	sched: debug, improve migration statistics add new migration statistics when SCHED_DEBUG and SCHEDSTATS is enabled. Available in /proc/<PID>/sched. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Peter Zijlstra	ff56b2f015	sched: activate task_hot() only on fair-scheduled tasks activate task_hot() only for fair-scheduled tasks (i.e. disable it for RT tasks). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	da84d96176	sched: reintroduce cache-hot affinity reintroduce a simplified version of cache-hot/cold scheduling affinity. This improves performance with certain SMP workloads, such as sysbench. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	178be79348	sched: do not normalize kernel threads via SysRq-N do not normalize kernel threads via SysRq-N: the migration threads, softlockup threads, etc. might be essential for the system to function properly. So only zap user tasks. pointed out by Andi Kleen. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Andi Kleen	1666703af9	sched: remove stale comment from sched_group_set_shares() remove stale comment from sched_group_set_shares(). Function never returns -EINVAL. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:18 +02:00
Ingo Molnar	d5036e89dc	sched: clean up is_migration_thread() clean up is_migration_thread() and turn it into an inline function. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:15 +02:00
Andi Kleen	3a5e4dc12f	sched: cleanup: refactor normalize_rt_tasks Replace a particularly ugly ifdef with an inline and a new macro. Also split up the function to be easier to read. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:15 +02:00
Andi Kleen	8cbbe86dfc	sched: cleanup: refactor common code of sleep_on / wait_for_completion Refactor common code of sleep_on / wait_for_completion These functions were largely cut'n'pasted. This moves the common code into single helpers instead. Advantage is about 1k less code on x86-64 and 91 lines of code removed. It adds one function call to the non timeout version of the functions; i don't expect this to be measurable. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Andi Kleen	3a5c359a58	sched: cleanup: remove unnecessary gotos Replace loops implemented with gotos with real loops. Replace err = ...; goto x; x: return err; with return ...; No functional changes. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Mike Galbraith	95938a35c5	sched: prevent wakeup over-scheduling Prevent wakeup over-scheduling. Once a task has been preempted by a task of the same or lower priority, it becomes ineligible for repeated preemption by same until it has been ticked, or slept. Instead, the task is marked for preemption at the next tick. Tasks of higher priority still preempt immediately. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Peter Zijlstra	ce6c131131	sched: disable forced preemption by default Implement feature bit to disable forced preemption. This way it can be checked whether a workload is overscheduling or not. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Zou Nan hai	ace8b3d633	sched: some proc entries are missed in sched_domain sys_ctl debug code cache_nice_tries and flags entry do not appear in proc fs sched_domain directory, because ctl_table entry is skipped. This patch fixes the issue. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Gautham R Shenoy	638e13ac37	sched: fix rt ptracer monopolizing CPU yield() in wait_task_inactive(), can cause a high priority thread to be scheduled back in, and there by loop forever while it is waiting for some lower priority thread which is unfortunately still on the runqueue. Use schedule_timeout_uninterruptible(1) instead. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Credit: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Dhaval Giani	5cb350baf5	sched: group scheduling, sysfs tunables Add tunables in sysfs to modify a user's cpu share. A directory is created in sysfs for each new user in the system. /sys/kernel/uids/<uid>/cpu_share Reading this file returns the cpu shares granted for the user. Writing into this file modifies the cpu share for the user. Only an administrator is allowed to modify a user's cpu share. Ex: # cd /sys/kernel/uids/ # cat 512/cpu_share 1024 # echo 2048 > 512/cpu_share # cat 512/cpu_share 2048 # Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Paul E. McKenney	a58f6f253d	sched: export cpu_clock() export cpu_clock() - the preferred API instead of sched_clock(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	00bf7bfc2e	sched: fix: move the CPU check into ->task_new_fair() noticed by Peter Zijlstra: fix: move the CPU check into ->task_new_fair(), this way we can call place_entity() and get child ->vruntime right at initial wakeup time. (without this there can be large latencies) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:14 +02:00
Ingo Molnar	4cf86d77f5	sched: cleanup: rename task_grp to task_group cleanup: rename task_grp to task_group. No need to save two characters and 'grp' is annoying to read. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:14 +02:00
Ingo Molnar	06877c33fe	sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG, to make SCHED_FEAT_ names more consistent. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	a65914b365	sched: kfree(NULL) is valid kfree(NULL) is valid. pointed out by checkpatch.pl. the fix shrinks the code a bit: text data bss dec hex filename 40024 3842 100 43966 abbe sched.o.before 40002 3842 100 43944 aba8 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	8927f49479	sched: style cleanup fix up __setup() style bug - noticed via checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	26797a34a2	sched: break out if printing a warning in sched_domain_debug() checkpatch.pl and Andy Whitcroft noticed the following bug: we did not break out after printing an error. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	3e9830dcab	sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y run sched_domain_debug() if CONFIG_SCHED_DEBUG=y, instead of relying on the hand-crafted SCHED_DOMAIN_DEBUG switch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Dmitry Adamushko	a4ec24b48d	sched: tidy up SCHED_RR - make timeslices of SCHED_RR tasks constant and not dependent on task's static_prio [1] ; - remove obsolete code (timeslice related bits); - make sched_rr_get_interval() return something more meaningful [2] for SCHED_OTHER tasks. [1] according to the following link, it's not compliant with SUSv3 (not sure though, what is the reference for us :-) http://lkml.org/lkml/2007/3/7/656 [2] the interval is dynamic and can be depicted as follows "should a task be one of the runnable tasks at this particular moment, it would expect to run for this interval of time before being re-scheduled by the scheduler tick". (i.e. it's more precise if a task is runnable at the moment) yeah, this seems to require task_rq_lock/unlock() but this is not a hot path. results: (SCHED_FIFO) dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval time_slice: 0 : 0 (SCHED_RR) dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval time_slice: 0 : 99984800 (SCHED_NORMAL) dimm@earth:~/storage/prog$ ./rr_interval time_slice: 0 : 19996960 (SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result) dimm@earth:~/storage/prog$ taskset 1 ./rr_interval time_slice: 0 : 9998480 Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Alexey Dobriyan	a9957449b0	sched: uninline scheduler * save ~300 bytes * activate_idle_task() was moved to avoid a warning bloat-o-meter output: add/remove: 6/0 grow/shrink: 0/16 up/down: 438/-733 (-295) <=== function old new delta __enqueue_entity - 165 +165 finish_task_switch - 110 +110 update_curr_rt - 79 +79 __load_balance_iterator - 32 +32 __task_rq_unlock - 28 +28 find_process_by_pid - 24 +24 do_sched_setscheduler 133 123 -10 sys_sched_rr_get_interval 176 165 -11 sys_sched_getparam 156 145 -11 normalize_rt_tasks 482 470 -12 sched_getaffinity 112 99 -13 sys_sched_getscheduler 86 72 -14 sched_setaffinity 226 212 -14 sched_setscheduler 666 642 -24 load_balance_start_fair 33 9 -24 load_balance_next_fair 33 9 -24 dequeue_task_rt 133 67 -66 put_prev_task_rt 97 28 -69 schedule_tail 133 50 -83 schedule 682 594 -88 enqueue_entity 499 366 -133 task_new_fair 317 180 -137 Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:13 +02:00
Ingo Molnar	1e81995066	sched: optimize schedule() a bit on SMP optimize schedule() a bit on SMP, by moving the rq-clock update outside the rq lock. code size is the same: text data bss dec hex filename 25725 2666 96 28487 6f47 sched.o.before 25725 2666 96 28487 6f47 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:13 +02:00
Ingo Molnar	3a25201572	sched: whitespace cleanups more whitespace cleanups. No code changed: text data bss dec hex filename 26553 2790 288 29631 73bf sched.o.before 26553 2790 288 29631 73bf sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Ingo Molnar	5522d5d5f7	sched: mark scheduling classes as const mark scheduling classes as const. The speeds up the code a bit and shrinks it: text data bss dec hex filename 40027 4018 292 44337 ad31 sched.o.before 40190 3842 292 44324 ad24 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Srivatsa Vaddagiri	2830cf8c90	sched: group scheduler SMP migration fix group scheduler SMP migration fix: use task_cfs_rq(p) to get to the relevant fair-scheduling runqueue of a task, rq->cfs is not the right one. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-15 17:00:12 +02:00
Ingo Molnar	2d72376b3a	sched: clean up schedstats, cnt -> count rename all 'cnt' fields and variables to the less yucky 'count' name. yuckage noticed by Andrew Morton. no change in code, other than the /proc/sched_debug bkl_count string got a bit larger: text data bss dec hex filename 38236 3506 24 41766 a326 sched.o.before 38240 3506 24 41770 a32a sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:12 +02:00
Hiroshi Shimamoto	2ddbf95250	sched: clean up sched_fork() The adjusting sched_class is a missing part of the already existing "do not leak PI boosting priority to the child" at the sched_fork(). This patch moves the adjusting sched_class from wake_up_new_task() to sched_fork(). this also shrinks the code a bit: text data bss dec hex filename 40111 4018 292 44421 ad85 sched.o.before 40102 4018 292 44412 ad7c sched.o.after Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:11 +02:00
Ingo Molnar	02e4bac2a5	sched: fix sched_fork() fix sched_fork(): large latencies at new task creation time because the ->vruntime was not fixed up cross-CPU, if the parent got migrated after the child's CPU got set up. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:11 +02:00
Ingo Molnar	94359f05cb	sched: undo some of the recent changes undo some of the recent changes that are not needed after all, such as last_min_vruntime. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Ingo Molnar	785c29ef95	sched: remove condition from set_task_cpu() remove condition from set_task_cpu(). Now that ->vruntime is not global anymore, it should (and does) work fine without it too. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:11 +02:00
Peter Zijlstra	ddc9729750	sched debug: check spread debug feature: check how well we schedule within a reasonable vruntime 'spread' range. (note that CPU overload can increase the spread, so this is not a hard condition, but normal loads should be within the spread.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:10 +02:00
Peter Zijlstra	67e9fb2a39	sched: add vslice add vslice: the load-dependent "virtual slice" a task should run ideally, so that the observed latency stays within the sched_latency window. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Ingo Molnar	b8efb56172	sched debug: BKL usage statistics add per task and per rq BKL usage statistics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:10 +02:00
Srivatsa Vaddagiri	24e377a832	sched: add fair-user scheduler Enable user-id based fair group scheduling. This is useful for anyone who wants to test the group scheduler w/o having to enable CONFIG_CGROUPS. A separate scheduling group (i.e struct task_grp) is automatically created for every new user added to the system. Upon uid change for a task, it is made to move to the corresponding scheduling group. A /proc tunable (/proc/root_user_share) is also provided to tune root user's quota of cpu bandwidth. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	9b5b77512d	sched: clean up code under CONFIG_FAIR_GROUP_SCHED With the view of supporting user-id based fair scheduling (and not just container-based fair scheduling), this patch renames several functions and makes them independent of whether they are being used for container or user-id based fair scheduling. Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating less-sized array for tg->cfs_rq[] and tf->se[]). Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:09 +02:00
Srivatsa Vaddagiri	83b699ed20	sched: revert recent removal of set_curr_task() Revert removal of set_curr_task. Use put_prev_task/set_curr_task when changing groups/policies Signed-off-by: Srivatsa Vaddagiri < vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	f6b53205e1	sched: rework enqueue/dequeue_entity() to get rid of set_curr_task() rework enqueue/dequeue_entity() to get rid of sched_class::set_curr_task(). This simplifies sched_setscheduler(), rt_mutex_setprio() and sched_move_tasks(). text data bss dec hex filename 24330 2734 20 27084 69cc sched.o.before 24233 2730 20 26983 6967 sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	4530d7ab0f	sched: simplify sched_class::yield_task() the 'p' (task_struct) parameter in the sched_class :: yield_task() is redundant as the caller is always the 'current'. Get rid of it. text data bss dec hex filename 24341 2734 20 27095 69d7 sched.o.before 24330 2734 20 27084 69cc sched.o.after Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:08 +02:00
Dmitry Adamushko	30cfdcfc5f	sched: do not keep current in the tree and get rid of sched_entity::fair_key Get rid of 'sched_entity::fair_key'. As a side effect, 'current' is not kept withing the tree for SCHED_NORMAL/BATCH tasks anymore. This simplifies some parts of code (e.g. entity_tick() and yield_task_fair()) and also somewhat optimizes them (e.g. a single update_curr() now vs. dequeue/enqueue() before in entity_tick()). Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Dmitry Adamushko	7074badbcb	sched: add set_curr_task() calls p->sched_class->set_curr_task() has to be called before activate_task()/enqueue_task() in rt_mutex_setprio(), sched_setschedule() and sched_move_task() in order to set up 'cfs_rq->curr'. The logic of enqueueing depends on whether a task to be inserted is 'current' or not. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Srivatsa Vaddagiri	29f59db3a7	sched: group-scheduler core Add interface to control cpu bandwidth allocation to task-groups. (not yet configurable, due to missing CONFIG_CONTAINERS) Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-10-15 17:00:07 +02:00
Mike Galbraith	119fe5e068	sched: fix SMP migration latencies fix SMP migration latencies: the vruntimes of different CPUs are at incompatible offsets so they have to be fixed up when migrating a task across CPUs. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:07 +02:00
Ingo Molnar	bbdba7c0e1	sched: remove wait_runtime fields and features remove wait_runtime based fields and features, now that the CFS math has been changed over to the vruntime metric. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Ingo Molnar	e22f5bbf86	sched: remove wait_runtime limit remove the wait_runtime-limit fields and the code depending on it, now that the math has been changed over to rely on the vruntime metric. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Dmitry Adamushko	495eca494a	sched: clean up struct load_stat 'struct load_stat' is redundant now so let's get rid of it. Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:06 +02:00
Peter Zijlstra	94dfb5e75e	sched: add tree based averages add support for tree based vruntime averages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	28a1f6fa2f	sched: remove SCHED_FEAT_SKIP_INITIAL remove SCHED_FEAT_SKIP_INITIAL - it was off by default and even when enabled it never made any real difference. Signed-off-by: Ingo Molnar <mingo@elte.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:05 +02:00
Ingo Molnar	6cb5819514	sched: optimize vruntime based scheduling optimize vruntime based scheduling. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	bf5c91ba8c	sched: move sched_feat() definitions move sched_feat() definitions so that it can be used sooner by generic code too. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	e9acbff648	sched: introduce se->vruntime introduce se->vruntime as a sum of weighted delta-exec's, and use that as the key into the tree. the idea to use absolute virtual time as the basic metric of scheduling has been first raised by William Lee Irwin, advanced by Tong Li and first prototyped by Roman Zippel in the "Really Fair Scheduler" (RFS) patchset. also see: http://lkml.org/lkml/2007/9/2/76 for a simpler variant of this patch. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	1091985b48	sched: speed up update_load_add/_sub() speed up update_load_add/_sub() by not delaying the division - this reduces CPU pipeline dependencies. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:04 +02:00
Ingo Molnar	62160e3f4a	sched: track cfs_rq->curr on !group-scheduling too Noticed by Roman Zippel: use cfs_rq->curr in the !group-scheduling case too. Small micro-optimization and cleanup effect: text data bss dec hex filename 36269 3482 24 39775 9b5f sched.o.before 36177 3486 24 39687 9b07 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	53df556e06	sched: remove precise CPU load calculations #2 continued removal of precise CPU load calculations. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	a25707f3ae	sched: remove precise CPU load CPU load calculations are statistical anyway, and there's little benefit from having it calculated on every scheduling event. So remove this code, it gets rid of a divide from the scheduler wakeup and context-switch fastpath. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	8ebc91d936	sched: remove stat_gran remove the stat_gran code - it was disabled by default and it causes unnecessary overhead. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:03 +02:00
Ingo Molnar	2bd8e6d422	sched: use constants if !CONFIG_SCHED_DEBUG use constants if !CONFIG_SCHED_DEBUG. this speeds up the code and reduces code-size: text data bss dec hex filename 27464 3014 16 30494 771e sched.o.before 26929 3010 20 29959 7507 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	38ad464d41	sched: uniform tunings use the same defaults on both UP and SMP. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	eba1ed4b7e	sched: debug: track maximum 'slice' track the maximum amount of time a task has executed while the CPU load was at least 2x. (i.e. at least two nice-0 tasks were runnable) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Matthias Kaehlcke	2e45874c5a	sched: use list_for_each_entry_safe() in __wake_up_common() Use list_for_each_entry_safe() instead of list_for_each_safe() in __wake_up_common() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:02 +02:00
Ingo Molnar	44142fac34	sched: fix sysctl_sched_child_runs_first flag fix the sched_child_runs_first flag: always call into ->task_new() if we are on the same CPU, as SCHED_OTHER tasks depend on it for correct initial setup. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-15 17:00:01 +02:00
Jens Axboe	f5ff8422bb	Fix warnings with !CONFIG_BLOCK Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup the (few) files that fail to build because they were relying on blkdev.h pulling in extra includes for them. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2007-10-10 09:25:57 +02:00
Hiroshi Shimamoto	9c95e7319b	sched: fix invalid sched_class use When using rt_mutex, a NULL pointer dereference is occurred at enqueue_task_rt. Here is a scenario; 1) there are two threads, the thread A is fair_sched_class and thread B is rt_sched_class. 2) Thread A is boosted up to rt_sched_class, because the thread A has a rt_mutex lock and the thread B is waiting the lock. 3) At this time, when thread A create a new thread C, the thread C has a rt_sched_class. 4) When doing wake_up_new_task() for the thread C, the priority of the thread C is out of the RT priority range, because the normal priority of thread A is not the RT priority. It makes data corruption by overflowing the rt_prio_array. The new thread C should be fair_sched_class. The new thread should be valid scheduler class before queuing. This patch fixes to set the suitable scheduler class. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-19 23:34:46 +02:00
Ingo Molnar	1799e35d5b	sched: add /proc/sys/kernel/sched_compat_yield add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield() more agressive, by moving the yielding task to the last position in the rbtree. with sched_compat_yield=0: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield 2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop with sched_compat_yield=1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop 2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-19 23:34:46 +02:00
Ingo Molnar	cf2ab4696e	sched: fix xtensa build warning rename RSR to SRR - 'RSR' is already defined on xtensa. found by Adrian Bunk. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:49 +02:00
Ingo Molnar	a206c07213	sched: debug: fix cfs_rq->wait_runtime accounting the cfs_rq->wait_runtime debug/statistics counter was not maintained properly - fix this. this also removes some code: text data bss dec hex filename 13420 228 1204 14852 3a04 sched.o.before 13404 228 1204 14836 39f4 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-09-05 14:32:49 +02:00
Suresh Siddha	7fd0d2dde9	sched: fix MC/HT scheduler optimization, without breaking the FUZZ logic. First fix the check if (imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) with this if (imbalance < busiest_load_per_task) As the current check is always false for nice 0 tasks (as SCHED_LOAD_SCALE_FUZZ is same as busiest_load_per_task for nice 0 tasks). With the above change, imbalance was getting reset to 0 in the corner case condition, making the FUZZ logic fail. Fix it by not corrupting the imbalance and change the imbalance, only when it finds that the HT/MC optimization is needed. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-09-05 14:32:48 +02:00
Ingo Molnar	f6cf891c4d	sched: make the scheduler converge to the ideal latency de-HZ-ification of the granularity defaults unearthed a pre-existing property of CFS: while it correctly converges to the granularity goal, it does not prevent run-time fluctuations in the range of [-gran ... 0 ... +gran]. With the increase of the granularity due to the removal of HZ dependencies, this becomes visible in chew-max output (with 5 tasks running): out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 out: 27 . 27. 32 \| flu: 0 . 0 \| ran: 17 . 13 \| per: 44 . 40 out: 27 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 36 . 40 out: 29 . 27. 32 \| flu: 2 . 0 \| ran: 17 . 13 \| per: 46 . 40 out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 out: 29 . 27. 32 \| flu: 0 . 0 \| ran: 18 . 13 \| per: 47 . 40 out: 28 . 27. 32 \| flu: 0 . 0 \| ran: 9 . 13 \| per: 37 . 40 average slice is the ideal 13 msecs and the period is picture-perfect 40 msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no mechanism in CFS to keep that from happening: it's a perfectly valid solution that CFS finds. to fix this we add a granularity/preemption rule that knows about the "target latency", which makes tasks that run longer than the ideal latency run a bit less. The simplest approach is to simply decrease the preemption granularity when a task overruns its ideal latency. For this we have to track how much the task executed since its last preemption. ( this adds a new field to task_struct, but we can eliminate that overhead in 2.6.24 by putting all the scheduler timestamps into an anonymous union. ) with this change in place, chew-max output is fluctuation-less all around: out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 2 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 1 \| ran: 13 . 13 \| per: 41 . 40 out: 28 . 27. 39 \| flu: 0 . 1 \| ran: 13 . 13 \| per: 41 . 40 this patch has no impact on any fastpath or on any globally observable scheduling property. (unless you have sharp enough eyes to see millisecond-level ruckles in glxgears smoothness :-) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mike Galbraith <efault@gmx.de>	2007-08-28 12:53:24 +02:00
Ingo Molnar	50c46637aa	sched: s/sched_latency/sched_min_granularity runtime limit and wakeup granularity used to be a function of granularity and that was incorrect changed to sched_latency. Fix this to make wakeup granularity a function of min-granularity, and the runtime limit equal to latency. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-25 22:17:19 +02:00
Ingo Molnar	172ac3dbb7	sched: cleanup, sched_granularity -> sched_min_granularity due to adaptive granularity scheduling the role of sched_granularity has changed to "minimum granularity", so rename the variable (and the tunable) accordingly. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>	2007-08-25 18:41:53 +02:00
Peter Zijlstra	218050855e	sched: adaptive scheduler granularity Instead of specifying the preemption granularity, specify the wanted latency. By fixing the granlarity to a constany the wakeup latency it a function of the number of running tasks on the rq. Invert this relation. sysctl_sched_granularity becomes a minimum for the dynamic granularity computed from the new sysctl_sched_latency. Then use this latency to do more intelligent granularity decisions: if there are fewer tasks running then we can schedule coarser. This helps performance while still always keeping the latency target. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-25 18:41:53 +02:00
Sven-Thorsten Dietrich	deac4ee65a	sched: simplify can_migrate_task() Remove trivial conditional branch in Linux scheduler's can_migrate_task() function. text data bss dec hex filename 34770 2998 24 37792 93a0 sched.o.before 34757 2998 24 37779 9393 sched.o.after Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Ingo Molnar	71fd371463	sched: remove HZ dependency from the granularity default remove HZ dependency from the granularity default. Use 10 msec for the base granularity, 1 msec for wakeup granularity and 25 msec for batch wakeup granularity. (These defaults are close to the values that the default HZ=250 setting got previously, and thus it's the most common setting.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-24 20:39:10 +02:00
Ingo Molnar	505c0efd58	sched: tweak the sched_runtime_limit tunable Michael Gerdau reported reniced task CPU usage weirdnesses. Such symptoms can be caused by limit underruns so double the sched_runtime_limit. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Suresh Siddha	f549da848e	sched: skip updating rq's next_balance under null SD Was playing with sched_smt_power_savings/sched_mc_power_savings and found out that while the scheduler domains are reconstructed when sysfs settings change, rebalance_domains() can get triggered with null domain on other cpus, which is setting next_balance to jiffies + 60*HZ. Resulting in no idle/busy balancing for 60 seconds. Fix this. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Suresh Siddha	f8700df7c4	sched: fix broken SMT/MC optimizations On a four package system with HT - HT load balancing optimizations were broken. For example, if two tasks end up running on two logical threads of one of the packages, scheduler is not able to pull one of the tasks to a completely idle package. In this scenario, for nice-0 tasks, imbalance calculated by scheduler will be 512 and find_busiest_queue() will return 0 (as each cpu's load is 1024 > imbalance and has only one task running). Similarly MC scheduler optimizations also get fixed with this patch. [ mingo@elte.hu: restored fair balancing by increasing the fuzz and adding it back to the power decision, without the /2 factor. ] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Eric W. Biederman	c57baf1e1e	sched: fix sysctl directory permissions There are two remaining gotchas: - The directories have impossible permissions (writeable). - The ctl_name for the kernel directory is inconsistent with everything else. It should be CTL_KERN. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-23 15:18:02 +02:00
Ingo Molnar	2aa44d0567	sched: sched_clock_idle_[sleep\|wakeup]_event() construct a more or less wall-clock time out of sched_clock(), by using ACPI-idle's existing knowledge about how much time we spent idling. This allows the rq clock to work around TSC-stops-in-C2, TSC-gets-corrupted-in-C3 type of problems. ( Besides the scheduler's statistics this also benefits blktrace and printk-timestamps as well. ) Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup callbacks allow the scheduler to get out the most of the period where the CPU has a reliable TSC. This results in slightly more precise task statistics. the ACPI bits were acked by Len. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Len Brown <len.brown@intel.com>	2007-08-23 15:18:02 +02:00
Oleg Nesterov	de0cf899bb	sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/ rebalance_domains(SCHED_IDLE) looks strange (typo), change it to CPU_IDLE. the effect of this bug was slightly more agressive idle-balancing on SMP than intended. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-12 18:08:19 +02:00
Adrian Bunk	6707de00fd	sched: make global code static This patch makes the following needlessly global code static: - arch_reinit_sched_domains() - struct attr_sched_mc_power_savings - struct attr_sched_smt_power_savings Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-12 18:08:19 +02:00
Ingo Molnar	529c77261b	sched: improve rq-clock overflow logic improve the rq-clock overflow logic: limit the absolute rq->clock delta since the last scheduler tick, instead of limiting the delta itself. tested by Arjan van de Ven - whole laptop was misbehaving due to an incorrectly calibrated cpu_khz confusing sched_clock(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>	2007-08-10 23:05:11 +02:00
Ingo Molnar	194081ebfa	sched: round a bit better round a tiny bit better in high-frequency rescheduling scenarios, by rounding around zero instead of rounding down. (this is pretty theoretical though) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	254753dc32	sched: make the multiplication table more accurate do small deltas in the weight and multiplication constant table so that the worst-case numeric error is better than 1:100000000. (8 digits) the current error table is: nice mult * inv_mult error ------------------------------------------ -20: 88761 * 48388 -0.0000000065 -19: 71755 * 59856 -0.0000000037 -18: 56483 * 76040 0.0000000056 -17: 46273 * 92818 0.0000000042 -16: 36291 * 118348 -0.0000000065 -15: 29154 * 147320 -0.0000000037 -14: 23254 * 184698 -0.0000000009 -13: 18705 * 229616 -0.0000000037 -12: 14949 * 287308 -0.0000000009 -11: 11916 * 360437 -0.0000000009 -10: 9548 * 449829 -0.0000000009 -9: 7620 * 563644 -0.0000000037 -8: 6100 * 704093 0.0000000009 -7: 4904 * 875809 0.0000000093 -6: 3906 * 1099582 -0.0000000009 -5: 3121 * 1376151 -0.0000000058 -4: 2501 * 1717300 0.0000000009 -3: 1991 * 2157191 -0.0000000035 -2: 1586 * 2708050 0.0000000009 -1: 1277 * 3363326 0.0000000014 0: 1024 * 4194304 0.0000000000 1: 820 * 5237765 0.0000000009 2: 655 * 6557202 0.0000000033 3: 526 * 8165337 -0.0000000079 4: 423 * 10153587 0.0000000012 5: 335 * 12820798 0.0000000079 6: 272 * 15790321 0.0000000037 7: 215 * 19976592 -0.0000000037 8: 172 * 24970740 -0.0000000037 9: 137 * 31350126 -0.0000000079 10: 110 * 39045157 -0.0000000061 11: 87 * 49367440 -0.0000000037 12: 70 * 61356676 0.0000000056 13: 56 * 76695844 -0.0000000075 14: 45 * 95443717 -0.0000000072 15: 36 * 119304647 -0.0000000009 16: 29 * 148102320 -0.0000000037 17: 23 * 186737708 -0.0000000028 18: 18 * 238609294 -0.0000000009 19: 15 * 286331153 -0.0000000002 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	6e82a3befe	sched: optimize update_rq_clock() calls in the load-balancer optimize update_rq_clock() calls in the load-balancer: update them right after locking the runqueue(s) so that the pull functions do not have to call it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	2daa357705	sched: optimize activate_task() optimize activate_task() by removing update_rq_clock() from it. (and add update_rq_clock() to all callsites of activate_task() that did not have it before.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	546fe3c909	sched: move the __update_rq_clock() call to scheduler_tick() move the __update_rq_clock() call from update_cpu_load() to scheduler_tick(). ( identity transformation that causes no change in functionality. ) this allows the direct use of rq->clock in ->task_tick() functions. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	bdd4dfa89c	sched: remove the 'u64 now' local variables final step: remove all (now superfluous) 'u64 now' variables. ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:51 +02:00
Ingo Molnar	2e1cb74a50	sched: remove the 'u64 now' parameter from deactivate_task() remove the 'u64 now' parameter from deactivate_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	69be72c13d	sched: remove the 'u64 now' parameter from dequeue_task() remove the 'u64 now' parameter from dequeue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	8159f87e2b	sched: remove the 'u64 now' parameter from enqueue_task() remove the 'u64 now' parameter from enqueue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	db53181e41	sched: remove the 'u64 now' parameter from dec_nr_running() remove the 'u64 now' parameter from dec_nr_running(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	e5fa2237b5	sched: remove the 'u64 now' parameter from inc_nr_running() remove the 'u64 now' parameter from inc_nr_running(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	79b5dddf83	sched: remove the 'u64 now' parameter from dec_load() remove the 'u64 now' parameter from dec_load(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	29b4b623fe	sched: remove the 'u64 now' parameter from inc_load() remove the 'u64 now' parameter from inc_load(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	84a1d7a2f9	sched: remove the 'u64 now' parameter from update_curr_load() remove the 'u64 now' parameter from update_curr_load(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	ee0827d8b5	sched: remove the 'u64 now' parameter from ->task_new() remove the 'u64 now' parameter from ->task_new(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	31ee529cc2	sched: remove the 'u64 now' parameter from ->put_prev_task() remove the 'u64 now' parameter from ->put_prev_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	ff95f3df54	sched: remove the 'u64 now' parameter from pick_next_task() remove the 'u64 now' parameter from pick_next_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:49 +02:00
Ingo Molnar	fb8d472402	sched: remove the 'u64 now' parameter from ->pick_next_task() remove the 'u64 now' parameter from ->pick_next_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:48 +02:00
Ingo Molnar	f02231e51a	sched: remove the 'u64 now' parameter from ->dequeue_task() remove the 'u64 now' parameter from ->dequeue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:48 +02:00
Ingo Molnar	fd390f6a04	sched: remove the 'u64 now' parameter from ->enqueue_task() remove the 'u64 now' parameter from ->enqueue_task(). ( identity transformation that causes no change in functionality. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:48 +02:00
Ingo Molnar	d281918d7c	sched: remove 'now' use from assignments change all 'now' timestamp uses in assignments to rq->clock. ( this is an identity transformation that causes no functionality change: all such new rq->clock is necessarily preceded by an update_rq_clock() call. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:47 +02:00
Ingo Molnar	eb59449400	sched: remove __rq_clock() remove the (now unused) __rq_clock() function. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:47 +02:00
Ingo Molnar	c1b3da3ecd	sched: eliminate __rq_clock() use eliminate __rq_clock() use by changing it to: __update_rq_clock(rq) now = rq->clock; identity transformation - no change in behavior. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:47 +02:00
Ingo Molnar	2ab81159fa	sched: remove rq_clock() remove the now unused rq_clock() function. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:47 +02:00
Ingo Molnar	a8e504d2a5	sched: eliminate rq_clock() use eliminate rq_clock() use by changing it to: update_rq_clock(rq) now = rq->clock; identity transformation - no change in behavior. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:47 +02:00
Ingo Molnar	b04a0f4c16	sched: add [__]update_rq_clock(rq) add the [__]update_rq_clock(rq) functions. (No change in functionality, just reorganization to prepare for elimination of the heavy 64-bit timestamp-passing in the scheduler.) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Peter Williams	a4ac01c36e	sched: fix bug in balance_tasks() There are two problems with balance_tasks() and how it used: 1. The variables best_prio and best_prio_seen (inherited from the old move_tasks()) were only required to handle problems caused by the active/expired arrays, the order in which they were processed and the possibility that the task with the highest priority could be on either. These issues are no longer present and the extra overhead associated with their use is unnecessary (and possibly wrong). 2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same this_best_prio variable needs to be used by all scheduling classes or there is a risk of moving too much load. E.g. if the highest priority task on this at the beginning is a fairly low priority task and the rt class migrates a task (during its turn) then that moved task becomes the new highest priority task on this_rq but when the sched_fair class initializes its copy of this_best_prio it will get the priority of the original highest priority task as, due to the run queue locks being held, the reschedule triggered by pull_task() will not have taken place. This could result in inappropriate overriding of skip_for_load and excessive load being moved. The attached patch addresses these problems by deleting all reference to best_prio and best_prio_seen and making this_best_prio a reference parameter to the various functions involved. load_balance_fair() has also been modified so that this_best_prio is only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set. This should preserve the effect of helping spread groups' higher priority tasks around the available CPUs while improving system performance when CONFIG_FAIR_GROUP_SCHED isn't set. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Alexey Dobriyan	e0361851e5	sched: remove binary sysctls from kernel.sched_domain kernel.sched_domain hierarchy is under CTL_UNNUMBERED and thus unreachable to sysctl(2). Generating .ctl_number's in such situation is not useful. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Ingo Molnar	8e717b194c	sched: schedule() speedup speed up schedule(): share the 'now' parameter that deactivate_task() was calculating internally. ( this also fixes the small accounting window between the deactivate call and the pick_next_task() call. ) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Ingo Molnar	7bfd048587	sched: uninline rq_clock() uninline rq_clock() to save 263 bytes of code: text data bss dec hex filename 39561 3642 24 43227 a8db sched.o.before 39298 3642 24 42964 a7d4 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Ulrich Drepper	9531b62f5e	sched: clean up sched_getaffinity() here's another tiny cleanup. The generated code is not affected (gcc is smart enough) but for people looking over the code it is just irritating to have the extra conditional. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Peter Williams	4301065920	sched: simplify move_tasks() The move_tasks() function is currently multiplexed with two distinct capabilities: 1. attempt to move a specified amount of weighted load from one run queue to another; and 2. attempt to move a specified number of tasks from one run queue to another. The first of these capabilities is used in two places, load_balance() and load_balance_idle(), and in both of these cases the return value of move_tasks() is used purely to decide if tasks/load were moved and no notice of the actual number of tasks moved is taken. The second capability is used in exactly one place, active_load_balance(), to attempt to move exactly one task and, as before, the return value is only used as an indicator of success or failure. This multiplexing of sched_task() was introduced, by me, as part of the smpnice patches and was motivated by the fact that the alternative, one function to move specified load and one to move a single task, would have led to two functions of roughly the same complexity as the old move_tasks() (or the new balance_tasks()). However, the new modular design of the new CFS scheduler allows a simpler solution to be adopted and this patch addresses that solution by: 1. adding a new function, move_one_task(), to be used by active_load_balance(); and 2. making move_tasks() a single purpose function that tries to move a specified weighted load and returns 1 for success and 0 for failure. One of the consequences of these changes is that neither move_one_task() or the new move_tasks() care how many tasks sched_class.load_balance() moves and this enables its interface to be simplified by returning the amount of load moved as its result and removing the load_moved pointer from the argument list. This helps simplify the new move_tasks() and slightly reduces the amount of work done in each of sched_class.load_balance()'s implementations. Further simplification, e.g. changes to balance_tasks(), are possible but (slightly) complicated by the special needs of load_balance_fair() so I've left them to a later patch (if this one gets accepted). NB Since move_tasks() gets called with two run queue locks held even small reductions in overhead are worthwhile. [ mingo@elte.hu ] this change also reduces code size nicely: text data bss dec hex filename 39216 3618 24 42858 a76a sched.o.before 39173 3618 24 42815 a73f sched.o.after Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:46 +02:00
Ingo Molnar	f1a438d813	sched: reorder update_cpu_load(rq) with the ->task_tick() call Peter Williams suggested to flip the order of update_cpu_load(rq) with the ->task_tick() call. This is a NOP for the current scheduler (the two functions are independent of each other), ->task_tick() might create some state for update_cpu_load() in the future (or in PlugSched). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-09 11:16:45 +02:00
Ingo Molnar	6cfb0d5d06	[PATCH] sched: reduce debug code move the rest of the debugging/instrumentation code to under CONFIG_SCHEDSTATS too. This reduces code size and speeds code up: text data bss dec hex filename 33044 4122 28 37194 914a sched.o.before 32708 4122 28 36858 8ffa sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	9c2172459a	[PATCH] sched: move load-calculation functions move load-calculation functions so that they can use the per-policy declarations and methods. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	cad60d93e1	[PATCH] sched: ->task_new cleanup make sched_class.task_new == NULL a 'default method', this allows the removal of task_rt_new. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	4e6f96f313	[PATCH] sched: uninline inc/dec_nr_running() uninline inc_nr_running() and dec_nr_running(): text data bss dec hex filename 29039 4162 24 33225 81c9 sched.o.before 29027 4162 24 33213 81bd sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	cb1c4fc924	[PATCH] sched: uninline calc_delta_mine() uninline calc_delta_mine(): text data bss dec hex filename 29162 4162 24 33348 8244 sched.o.before 29039 4162 24 33225 81c9 sched.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	ecf691daf7	[PATCH] sched: calc_delta_mine(): use fixed limit use fixed limit in calc_delta_mine() - this saves an instruction :) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Peter Williams	5a4f3ea77e	[PATCH] sched: tidy up left over smpnice code 1. The only place that RTPRIO_TO_LOAD_WEIGHT() is used is in the call to move_tasks() in the function active_load_balance() and its purpose here is just to make sure that the load to be moved is big enough to ensure that exactly one task is moved (if there's one available). This can be accomplished by using ULONG_MAX instead and this allows RTPRIO_TO_LOAD_WEIGHT() to be deleted. 2. This, in turn, allows PRIO_TO_LOAD_WEIGHT() to be deleted. 3. This allows load_weight() to be deleted which allows TIME_SLICE_NICE_ZERO to be deleted along with the comment above it. Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Ingo Molnar	362a701663	[PATCH] sched: remove cache_hot_time remove the last unused remains of cache_hot_time. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-08-02 17:41:40 +02:00
Randy Dunlap	421cee2935	sched: fix kernel-doc warnings Fix kernel-doc warnings in sched.c: Warning(linux-2623-rc1g4//kernel/sched.c:1685): No description found for parameter 'notifier' Warning(linux-2623-rc1g4//kernel/sched.c:1696): No description found for parameter 'notifier' Warning(linux-2623-rc1g4//kernel/sched.c:1750): No description found for parameter 'prev' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-31 15:39:38 -07:00
Nick Piggin	e692ab5347	[PATCH] sched: debug feature - make the sched-domains tree runtime-tweakable debugging feature: make the sched-domains tree runtime-tweakable. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ mingo@elte.hu: made it depend on CONFIG_SCHED_DEBUG & small updates ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-26 13:40:43 +02:00
Ingo Molnar	2cd4d0ea19	[PATCH] sched: make cpu_clock() not use the rq clock it is enough to disable interrupts to get the precise rq-clock of the local CPU. this also solves an NMI watchdog regression: the NMI watchdog calls touch_softlockup_watchdog(), which might deadlock on rq->lock if the NMI hits an rq-locked critical section. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-26 13:40:43 +02:00
Satoru Takeuchi	018a221295	[PATCH] sched: remove unused rq->load_balance_class Remove unused rq->load_balance_class. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-26 13:40:43 +02:00
Avi Kivity	e107be36ef	[PATCH] sched: arch preempt notifier mechanism This adds a general mechanism whereby a task can request the scheduler to notify it whenever it is preempted or scheduled back in. This allows the task to swap any special-purpose registers like the fpu or Intel's VT registers. Signed-off-by: Avi Kivity <avi@qumranet.com> [ mingo@elte.hu: fixes, cleanups ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-26 13:40:43 +02:00
Ingo Molnar	e436d80085	[PATCH] sched: implement cpu_clock(cpu) high-speed time source Implement the cpu_clock(cpu) interface for kernel-internal use: high-speed (but slightly incorrect) per-cpu clock constructed from sched_clock(). This API, unused at the moment, will be used in the future by blktrace, by the softlockup-watchdog, by printk and by lockstat. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-19 21:28:35 +02:00
Suresh Siddha	969bb4e403	[PATCH] sched: fix the all pinned logic in load_balance_newidle() nr_moved is not the correct check for triggering all pinned logic. Fix the all pinned logic in the case of load_balance_newidle(). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-19 21:28:35 +02:00
Suresh Siddha	9439aab8db	[PATCH] sched: fix newly idle load balance in case of SMT In the presence of SMT, newly idle balance was never happening for multi-core and SMP domains (even when both the logical siblings are idle). If thread 0 is already idle and when thread 1 is about to go to idle, newly idle load balance always think that one of the threads is not idle and skips doing the newly idle load balance for multi-core and SMP domains. This is because of the idle_cpu() macro, which checks if the current process on a cpu is an idle process. But this is not the case for the thread doing the load_balance_newidle(). Fix this by using runqueue's nr_running field instead of idle_cpu(). And also skip the logic of 'only one idle cpu in the group will be doing load balancing' during newly idle case. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-19 21:28:35 +02:00
Fenghua Yu	f34e3b61f2	use the new percpu interface for shared data Currently most of the per cpu data, which is accessed by different cpus, has a ____cacheline_aligned_in_smp attribute. Move all this data to the new per cpu shared data section: .data.percpu.shared_aligned. This will seperate the percpu data which is referenced frequently by other cpus from the local only percpu data. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-19 10:04:45 -07:00
Rafael J. Wysocki	8314418629	Freezer: make kernel threads nonfreezable by default Currently, the freezer treats all tasks as freezable, except for the kernel threads that explicitly set the PF_NOFREEZE flag for themselves. This approach is problematic, since it requires every kernel thread to either set PF_NOFREEZE explicitly, or call try_to_freeze(), even if it doesn't care for the freezing of tasks at all. It seems better to only require the kernel threads that want to or need to be frozen to use some freezer-related code and to remove any freezer-related code from the other (nonfreezable) kernel threads, which is done in this patch. The patch causes all kernel threads to be nonfreezable by default (ie. to have PF_NOFREEZE set by default) and introduces the set_freezable() function that should be called by the freezable kernel threads in order to unset PF_NOFREEZE. It also makes all of the currently freezable kernel threads call set_freezable(), so it shouldn't cause any (intentional) change of behaviour to appear. Additionally, it updates documentation to describe the freezing of tasks more accurately. [akpm@linux-foundation.org: build fixes] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-17 10:23:02 -07:00
Ingo Molnar	e4af30be8f	[PATCH] sched: prettify prio_to_wmult[] prettify the prio_to_wmult[] array. (this could have saved us from the typos) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-16 09:46:31 +02:00
Ingo Molnar	5714d2de93	[PATCH] sched: document prio_to_wmult[] document prio_to_wmult[]. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-16 09:46:31 +02:00
Ingo Molnar	f9153ee6c7	[PATCH] sched: improve weight-array comments improve the comments around the wmult array (which controls the weight of niced tasks). Clarify that to achieve a 10% difference in CPU utilization, a weight multiplier of 1.25 has to be used. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-16 09:46:30 +02:00
Thomas Gleixner	4fd885170b	CFS: Fix missing digit off in wmult table Roman Zippel noticed another inconsistency of the wmult table. wmult[16] has a missing digit. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 16:45:43 -07:00
Ingo Molnar	4bd77321a8	[PATCH] sched: fix show_task()/show_tasks() output fix show_task()/show_tasks() output: - there's no sibling info anymore - the fields were not aligned properly with the description - get rid of the lazy-TLB output: it's been quite some time since we last had a bug there, and when we had a bug it wasnt helped a bit by this debug output. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 10:11:17 -07:00
Ingo Molnar	a5968df873	[PATCH] sched: allow larger granularity Allow granularity up to 100 msecs, instead of 10 msecs. (needed on larger boxes) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 10:10:08 -07:00
Mike Galbraith	e127031f4f	[PATCH] sched: fix prio_to_wmult[] for nice 1 There's a typo in the values in prio_to_wmult[] for nice level 1. While it did not cause bad CPU distribution, but caused more rescheduling between nice-0 and nice-1 tasks than necessary. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-13 10:09:02 -07:00
Ingo Molnar	c31f2e8a42	sched: add CFS credits add credits for recent major scheduler contributions: Con Kolivas, for pioneering the fair-scheduling approach Peter Williams, for smpnice Mike Galbraith, for interactivity tuning of CFS Srivatsa Vaddagiri, for group scheduling enhancements Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:01 +02:00
Ingo Molnar	0fec171cdb	sched: clean up sleep_on() APIs clean up the sleep_on() APIs: - do not use fastcall - replace fragile macro magic with proper inline functions Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:01 +02:00
Ingo Molnar	9761eea851	sched: style cleanups 4 small style cleanups to sched.c: checkpatch.pl is now happy about the totality of sched.c [ignoring false positives] - yay! ;-) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	b2cfba19f6	sched: remove unused rq types from sched.c remove unused rq types from sched.c, now that we switched over to CFS. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	634fa8c97c	sched: remove interactivity types remove now unused interactivity-heuristics related defined and types of the old scheduler. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	dff06c157b	sched: clean up include files in sched.c clean up include files in sched.c, they were still old-style <asm/>. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	1b9f19c212	sched: turn on the use of unstable events make use of sched-clock-unstable events. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	bb29ab2686	sched: x86, track TSC-unstable events track TSC-unstable events and propagate it to the scheduler code. Also allow sched_clock() to be used when the TSC is unstable, the rq_clock() wrapper creates a reliable clock out of it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	dd41f596cd	sched: cfs core code apply the CFS core code. this change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies. thanks to Andrew Morton and Thomas Gleixner for lots of detailed review feedback and for fixlets. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>	2007-07-09 18:51:59 +02:00
Ingo Molnar	f3479f10c5	sched: remove the sleep-bonus interactivity code remove the sleep-bonus interactivity code from the core scheduler. scheduling policy is implemented in the policy modules, and CFS does not need such type of heuristics. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	c18a17329b	sched: remove expired_starving() remove the expired_starving() heuristics from the core scheduler. CFS does not need it, and this did not really work well in practice anyway, due to the rq->nr_running multiplier to STARVATION_LIMIT. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	f2ac58ee61	sched: remove sleep_type remove the sleep_type heuristics from the core scheduler - scheduling policy is implemented in the scheduling-policy modules. (and CFS does not use this type of sleep-type heuristics) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	45bf76df48	sched: cfs, add load-calculation methods add the new load-calculation methods of CFS. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	14531189f0	sched: clean up __normal_prio() position clean up: move __normal_prio() in head of normal_prio(). no code changed. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	71f8bd4600	sched: cleanup: move dequeue/enqueue_task() cleanup: move dequeue/enqueue_task() to a more logical place, to not split up __normal_prio()/normal_prio(). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	c24d20dbef	sched: move around resched_task() move resched_task()/resched_cpu() into the 'public interfaces' section of sched.c, for use by kernel/sched_fair/rt/idletask.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	e05606d330	sched: clean up the rt priority macros clean up the rt priority macros, pointed out by Andrew Morton. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	138a8aeb5b	sched: add cfs_rq ops add the set_task_cfs_rq() abstraction needed by CONFIG_FAIR_GROUP_SCHED. (not activated yet) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	41b86e9c51	sched: make posix-cpu-timers use CFS's accounting information update the posix-cpu-timers code to use CFS's CPU accounting information. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	20d315d42a	sched: add rq_clock()/__rq_clock() add rq_clock()/__rq_clock(), a robust wrapper around sched_clock(), used by CFS. It protects against common type of sched_clock() problems (caused by hardware): time warps forwards and backwards. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	6aa645ea5f	sched: cfs rq data types add the CFS rq data types to sched.c. (the old scheduler fields are still intact, they are removed by a later patch) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	425e0968a2	sched: move code into kernel/sched_stats.h create sched_stats.h and move sched.c schedstats code into it. This cleans up sched.c a bit. no code changes are caused by this patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	1df21055e3	sched: add init_idle_bootup_task() add the init_idle_bootup_task() callback to the bootup thread, unused at the moment. (CFS will use it to switch the scheduling class of the boot thread to the idle class) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	f64f61145a	sched: remove sched_exit() remove sched_exit(): the elaborate dance of us trying to recover timeslices given to child tasks never really worked. CFS does not need it either. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	c65cc87052	sched: uninline set_task_cpu() uninline set_task_cpu(): CFS will add more code to it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	0437e109e1	sched: zap the migration init / cache-hot balancing code the SMP load-balancer uses the boot-time migration-cost estimation code to attempt to improve the quality of balancing. The reason for this code is that the discrete priority queues do not preserve the order of scheduling accurately, so the load-balancer skips tasks that were running on a CPU 'recently'. this code is fundamental fragile: the boot-time migration cost detector doesnt really work on systems that had large L3 caches, it caused boot delays on large systems and the whole cache-hot concept made the balancing code pretty undeterministic as well. (and hey, i wrote most of it, so i can say it out loud that it sucks ;-) under CFS the same purpose of cache affinity can be achieved without any special cache-hot special-case: tasks are sorted in the 'timeline' tree and the SMP balancer picks tasks from the left side of the tree, thus the most cache-cold task is balanced automatically. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:57 +02:00
Ingo Molnar	d15bcfdbe1	sched: rename idle_type/SCHED_IDLE enum idle_type (used by the load-balancer) clashes with the SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead of 'SCHED_IDLE' is more descriptive as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:57 +02:00
Christoph Lameter	92c4ca5c3a	sched: fix next_interval determination in idle_balance() The intervals of domains that do not have SD_BALANCE_NEWIDLE must be considered for the calculation of the time of the next balance. Otherwise we may defer rebalancing forever. Siddha also spotted that the conversion of the balance interval to jiffies is missing. Fix that to. From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> also continue the loop if !(sd->flags & SD_LOAD_BALANCE). Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> It did in fact trigger under all three of mainline, CFS, and -rt including CFS -- see below for a couple of emails from last Friday giving results for these three on the AMD box (where it happened) and on a single-quad NUMA-Q system (where it did not, at least not with such severity). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 08:59:11 -07:00
Linus Torvalds	fa490cfd15	Fix possible runqueue lock starvation in wait_task_inactive() Miklos Szeredi reported very long pauses (several seconds, sometimes more) on his T60 (with a Core2Duo) which he managed to track down to wait_task_inactive()'s open-coded busy-loop. He observed that an interrupt on one core tries to acquire the runqueue-lock but does not succeed in doing so for a very long time - while wait_task_inactive() on the other core loops waiting for the first core to deschedule a task (which it wont do while spinning in an interrupt handler). This rewrites wait_task_inactive() to do all its waiting optimistically without any locks taken at all, and then just double-check the end result with the proper runqueue lock held over just a very short section. If there were races in the optimistic wait, of a preemption event scheduled the process away, we simply re-synchronize, and start over. So the code now looks like this: repeat: /* Unlocked, optimistic looping! / rq = task_rq(p); while (task_running(rq, p)) cpu_relax(); / Get the real values / rq = task_rq_lock(p, &flags); running = task_running(rq, p); array = p->array; task_rq_unlock(rq, &flags); / Check them.. / if (unlikely(running)) { cpu_relax(); goto repeat; } / Preempted away? Yield if so.. */ if (unlikely(array)) { yield(); goto repeat; } Basically, that first "while()" loop is done entirely without any locking at all (and doesn't check for the case where the target process might have been preempted away), and so it's possibly "incorrect", but we don't really care. Both the runqueue used, and the "task_running()" check might be the wrong tests, but they won't oops - they just mean that we could possibly get the wrong results due to lack of locking and exit the loop early in the case of a race condition. So once we've exited the loop, we then get the proper (and careful) rq lock, and check the running/runnable state _safely_. And if it turns out that our quick-and-dirty and unsafe loop was wrong after all, we just go back and try it all again. (The patch also adds a lot of comments, which is the actual bulk of it all, to make it more obvious why we can do these things without holding the locks). Thanks to Miklos for all the testing and tracking it down. Tested-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 11:52:55 -07:00
Ingo Molnar	a0f98a1cb7	sched: fix SysRq-N (normalize RT tasks) Gene Heskett reported the following problem while testing CFS: SysRq-N is not always effective in normalizing tasks back to SCHED_OTHER. The reason for that turns out to be the following bug: - normalize_rt_tasks() uses for_each_process() to iterate through all tasks in the system. The problem is, this method does not iterate through all tasks, it iterates through all thread groups. The proper mechanism to enumerate over all threads is to use a do_each_thread() + while_each_thread() loop. Reported-by: Gene Heskett <gene.heskett@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 11:52:55 -07:00
Thomas Gleixner	98d8256739	Prevent going idle with softirq pending The NOHZ patch contains a check for softirqs pending when a CPU goes idle. The BUG is unrelated to NOHZ, it just was made visible by the NOHZ patch. The BUG showed up mainly on P4 / hyperthreading enabled machines which lead the investigations into the wrong direction in the first place. The real cause is in cond_resched_softirq(): cond_resched_softirq() is enabling softirqs without invoking the softirq daemon when softirqs are pending. This leads to the warning message in the NOHZ idle code: t1 runs softirq disabled code on CPU#0 interrupt happens, softirq is raised, but deferred (softirqs disabled) t1 calls cond_resched_softirq() enables softirqs via _local_bh_enable() calls schedule() t2 runs t1 is migrated to CPU#1 t2 is done and invokes idle() NOHZ detects the pending softirq Fix: change _local_bh_enable() to local_bh_enable() so the softirq daemon is invoked. Thanks to Anant Nitya for debugging this with great patience ! Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-23 20:14:15 -07:00
Rafael J. Wysocki	8bb7844286	Add suspend-related notifications for CPU hotplug Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:56 -07:00
Gautham R Shenoy	5be9361cdf	Eliminate lock_cpu_hotplug in kernel/schedc Eliminate lock_cpu_hotplug from kernel/sched.c and use sched_hotcpu_mutex instead to postpone a hotplug event. In the migration_call hotcpu callback function, take sched_hotcpu_mutex while handling the event CPU_LOCK_ACQUIRE and release it while handling CPU_LOCK_RELEASE event. [akpm@linux-foundation.org: fix deadlock] Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-09 12:30:51 -07:00
Andrew Morton	d5f9f942c6	revert 'sched: redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array' Revert commit `bd53f96ca5`. Con says: This is no good, sorry. The one I saw originally was with the staircase deadline cpu scheduler in situ and was different. #define TASK_PREEMPTS_CURR(p, rq) \ ((p)->prio < (rq)->curr->prio) (((p)->prio < (rq)->curr->prio) && ((p)->array == (rq)->active)) This will fail to wake up a runqueue for a task that has been migrated to the expired array of a runqueue which is otherwise idle which can happen with smp balancing, Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com> Cc: Con Kolivas <kernel@kolivas.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-05-08 20:41:15 -07:00

... 4 5 6 7 8 ...

774 Commits