linux

Commit Graph

Author	SHA1	Message	Date
Thomas Gleixner	b5faba21a6	genirq: Prepare the handling of shared oneshot interrupts For level type interrupts we need to track how many threads are on flight to avoid useless interrupt storms when not all thread handlers have finished yet. Keep track of the woken threads and only unmask when there are no more threads in flight. Yes, I'm lazy and using a bitfield. But not only because I'm lazy, the main reason is that it's way simpler than using a refcount. A refcount based solution would need to keep track of various things like crashing the irq thread, spurious interrupts coming in, disables/enables, free_irq() and some more. The bitfield keeps the tracking simple and makes things just work. It's also nicely confined to the thread code pathes and does not require additional checks all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110223234956.388095876@linutronix.de>	2011-02-25 20:24:21 +01:00
Thomas Gleixner	1204e95689	genirq: Make warning in handle_percpu_event useful The WARN_ON_ONCE in handle_percpu_event() which emits a warning when an action handler returns with interrupts enabled is not really useful. It does not reveal the interrupt number and handler function which caused it. Make it WARN_ONCE() and add the information. Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-25 17:17:18 +01:00
Heiko Carstens	7e9498705e	sched: Add #ifdef around irq time accounting functions Get rid of this: kernel/sched.c:3731:13: warning: 'irqtime_account_idle_ticks' defined but not used kernel/sched.c:3732:13: warning: 'irqtime_account_process_tick' defined but not used Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110225133228.GD7469@osiris.boeblingen.de.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-25 14:39:48 +01:00
Peter Zijlstra	768a06e2ca	perf: Simplify task_clock_event_read() There is no point in us having different code paths for nmi and !nmi here, so remove the !nmi one. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:35:47 +01:00
Stephane Eranian	3f7cce3c18	perf_events: Fix rcu and locking issues with cgroup support This patches ensures that we do not end up calling perf_cgroup_from_task() when there is no cgroup event. This avoids potential RCU and locking issues. The change in perf_cgroup_set_timestamp() ensures we check against ctx->nr_cgroups. It also avoids calling perf_clock() tiwce in a row. It also ensures we do need to grab ctx->lock before calling the function. We drop update_cgrp_time() from task_clock_event_read() because it is not needed. This also avoids having to deal with perf_cgroup_from_task(). Thanks to Peter Zijlstra for his help on this. Signed-off-by: Stephane Eranian <eranian@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d5e76b8.815bdf0a.7ac3.774f@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:35:46 +01:00
Mike Galbraith	511f67a599	sched, autogroup: Stop claiming ownership of the root task group Disown it, and only display autogroup association if one exists. Signed-off-by: Mike Galbraith <efault@gmx.de> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1298383320.8036.5.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:34:03 +01:00
Yong Zhang	800d4d30c8	sched, autogroup: Stop going ahead if autogroup is disabled when autogroup is disable from the beginning, sched_autogroup_create_attach() autogroup_move_group() <== 1 sched_move_task() <== 2 task_move_group_fair() set_task_rq() task_group() autogroup_task_group() We go the whole path without doing anything useful. Then stop going further if autogroup is disabled. But there will be a race window between 1 and 2, in which sysctl_sched_autogroup_enabled is enabled. This issue will be toke by following patch. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <1298185696-4403-4-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:59 +01:00
Yong Zhang	1747b21fec	sched, autogroup, sysctl: Use proc_dointvec_minmax() instead sched_autogroup_enabled has min/max value, proc_dointvec_minmax() is be used for this case. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1298185696-4403-2-git-send-email-yong.zhang0@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:58 +01:00
Peter Zijlstra	866ab43efd	sched: Fix the group_imb logic On a 262 machine something like: taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done' _should_ result in 9 busy CPUs, each running 1 task. However it didn't quite work reliably, most of the time one cpu of the second socket (6-11) would be idle and one cpu of the first socket (0-5) would have two tasks on it. The group_imb logic is supposed to deal with this and detect when a particular group is imbalanced (like in our case, 0-2 are idle but 3-5 will have 4 tasks on it). The detection phase needed a bit of a tweak as it was too weak and required more than 2 avg weight tasks difference between idle and busy cpus in the group which won't trigger for our test-case. So cure that to be one or more avg task weight difference between cpus. Once the detection phase worked, it was then defeated by the f_b_g() tests trying to avoid ping-pongs. In particular, this_load >= max_load triggered because the pulling cpu (the (first) idle cpu in on the second socket, say 6) would find this_load to be 5 and max_load to be 4 (there'd be 5 tasks running on our socket and only 4 on the other socket). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:57 +01:00
Peter Zijlstra	cc57aa8f4b	sched: Clean up some f_b_g() comments The existing comment tends to grow state (as it already has), split it up and place it near the actual tests. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:56 +01:00
Peter Zijlstra	c186fafe9a	sched: Clean up remnants of sd_idle With the wholesale removal of the sd_idle SMT logic we can clean up some more. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:33:55 +01:00
Ingo Molnar	d927dc9379	Merge commit 'v2.6.38-rc6' into sched/core Merge reason: Pick up the latest fixes before queueing up new changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:31:38 +01:00
Jan Beulich	fd4afaf333	genirq: Streamline kernel/irq/Kconfig "def_bool n" without prompt is pointless, these should be just "bool". [ tglx: Adapted to latest changes ] Signed-off-by: Jan Beulich <jbeulich@novell.com> LKML-Reference: <4D5D3309020000780003264A@vpn.id2.novell.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 22:33:10 +01:00
Thomas Gleixner	dbebbfbb16	rtmutex: tester: Remove the remaining BKL leftovers We just leave the numbers assinged as commemoration and in case that someone was crazy enough to reimplement the test stuff out of tree. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 22:07:22 +01:00
Linus Torvalds	571020df6f	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for now genirq: Prevent access beyond allocated_irqs bitmap	2011-02-22 09:26:17 -08:00
Linus Torvalds	ee88347755	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix throttle logic perf, x86: P4 PMU: Fix spurious NMI messages	2011-02-22 09:25:55 -08:00
Thomas Gleixner	70433c0161	genirq: Use the correct variable for note_interrupt note_interrupt wants to be called with the combined result of all handlers called, not with the last one. If it's a shared interrupt then the last handler might return IRQ_NONE often enough to trigger the spurious dectector which turns off a perfectly fine working interrupt line. Bug was introduced in commit 1277a532(genirq: Simplify handle_irq_event()). Yes, I really messed up there. First the variable ret should not have been named differently to avoid similarity with retval. Second it should have been declared in the do {} loop. Rename it to res and move it into the do {} loop and vanish under a huge brown paperbag. Reported-bisected-tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 13:02:03 +01:00
John Stultz	7fdd7f8900	timers: Export CLOCK_BOOTTIME via the posix timers interface This patch exports CLOCK_BOOTTIME through the posix timers interface CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:09 -08:00
John Stultz	70a08cca12	timers: Add CLOCK_BOOTTIME hrtimer base CLOCK_MONOTONIC stops while the system is in suspend. This is because to applications system suspend is invisible. However, there is a growing set of applications that are wanting to be suspend-aware, but do not want to deal with the complications of CLOCK_REALTIME (which might jump around if settimeofday is called). For these applications, I propose a new clockid: CLOCK_BOOTTIME. CLOCK_BOOTTIME is idential to CLOCK_MONOTONIC, except it also includes any time spent in suspend. This patch add hrtimer base for CLOCK_BOOTTIME, using get_monotonic_boottime/ktime_get_boottime, to allow in kernel users to set timers against. CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:08 -08:00
John Stultz	314ac37150	time: Extend get_xtime_and_monotonic_offset() to also return sleep Extend get_xtime_and_monotonic_offset to get_xtime_and_monotonic_and_sleep_offset(). CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:07 -08:00
John Stultz	abb3a4ea2e	time: Introduce get_monotonic_boottime and ktime_get_boottime This adds new functions that return the monotonic time since boot (in other words, CLOCK_MONOTONIC + suspend time). CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:05 -08:00
John Stultz	e06383db9e	hrtimers: extend hrtimer base code to handle more then 2 clockids The hrtimer code is written mainly with CLOCK_REALTIME and CLOCK_MONOTONIC in mind. These are clockids 0 and 1 resepctively. However, if we are to introduce any new hrtimer bases, using new clockids, we have to skip the cputimers (clockids 2,3) as well as other clockids that may not impelement timers. This patch adds a little bit of indirection between the clockid and the base, so that we can extend the base by one when we add a new clockid at number 7 or so. CC: Jamie Lokier <jamie@shareable.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Alexander Shishkin <virtuoso@slind.org> CC: Arve Hjønnevåg <arve@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2011-02-21 12:53:04 -08:00
Thomas Gleixner	8fff39e069	genirq: Add missing break in __irq_set_trigger() The switch case in __irq_set_trigger() lacks a break, which emits a pr_err unconditionally on success. Reported-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:20:00 +01:00
Yinghai Lu	ed4dea6e0e	genirq: Use IRQ_BITMAP_BITS as search size in irq_alloc_descs() The runtime expansion of nr_irqs does not take into account that bitmap_find_next_zero_area() returns "start" + size in case the search for an matching zero area fails. That results in a start value which can be completely off and is not covered by the following expand_nr_irqs() and possibly outside of the absolute limit. But we use it without further checking. Use IRQ_BITMAP_BITS as the limit for the bitmap search and expand nr_irqs when the start bit is beyond nr_irqs. So start is always pointing to the correct area in the bitmap. nr_irqs is just the limit for irq enumerations, not the real limit for the irq space. [ tglx: Let irq_expand_nr_irqs() take the new upper end so we do not expand nr_irqs more than necessary. Made changelog readable ] Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D6014F9.8040605@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:20:00 +01:00
Thomas Gleixner	a61d825808	genirq: Fix misplaced status update in irq_disable() We lazy disable interrupt lines, so only mark the line masked, when the chip provides an irq_disable callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-21 21:19:59 +01:00
Tejun Heo	24d51add74	workqueue: fix build failure introduced by s/freezeable/freezable/ wq:fixes-2.6.38 does s/WQ_FREEZEABLE/WQ_FREEZABLE and wq:for-2.6.39 adds new usage of the flag. The combination of the two creates a build failure after merge. Fix it by renaming all freezeables to freezables. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>	2011-02-21 10:07:23 +01:00
Tejun Heo	43d133c18b	Merge branch 'master' into for-2.6.39	2011-02-21 09:43:56 +01:00
David S. Miller	da935c66ba	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/e1000e/netdev.c net/xfrm/xfrm_policy.c	2011-02-19 19:17:35 -08:00
Thomas Gleixner	a439520f8b	genirq: Implement irq_data based move_*_irq() versions No need to lookup the irq descriptor when calling from a chip callback function which has irq_data already handy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:25 +01:00
Thomas Gleixner	77694b408a	genirq; Add fasteoi irq_chip quirk Some chips want irq_eoi() only called when an interrupt is actually handled. So they have checks for INPROGRESS and DISABLED in their irq_eoi callbacks. Add a chip flag, which allows to handle that in the generic code. No impact on the fastpath. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:24 +01:00
Thomas Gleixner	781295762d	genirq: Add preflow handler support sparc64 needs to call a preflow handler on certain interrupts befor calling the action chain. Integrate it into handle_fasteoi_irq. Must be enabled via CONFIG_IRQ_FASTEOI_PREFLOW. No impact when disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David S. Miller <davem@davemloft.net>	2011-02-19 12:58:24 +01:00
Thomas Gleixner	3836ca08aa	genirq: Consolidate set_chip_handler functions No need to have separate functions if we have one plus inline wrappers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	02725e7471	genirq: Use irq_get/put functions Convert the management functions to use the common irq_get/put function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	d5eb4ad2df	genirq: Implement irq_get/put_desc_[bus]locked/unlock() Most of the managing functions get the irq descriptor and lock it - either with or without buslock. Instead of open coding this over and over provide a common function to do that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	091738a266	genirq: Remove real old transition functions These transition helpers are stale for years now. Remove them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:23 +01:00
Thomas Gleixner	a6967caf00	genirq: Remove desc->status when GENERIC_HARDIRQS_NO_COMPAT=y If everything uses the right accessors, then enabling GENERIC_HARDIRQS_NO_COMPAT should just work. If not it will tell you. Don't be lazy and use the trick which I use in the core code! git grep status_use_accessors will unearth it in a split second. Offenders are tracked down and not slapped with stinking trouts. This time we use frozen shark for a better educational value. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	e1ef824146	genirq: Reflect IRQ_MOVE_PCNTXT in irq_data state Required by x86. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	7f94226f03	genirq: Move wakeup state to irq_data Some irq_chips need to know the state of wakeup mode for setting the trigger type etc. Reflect it in irq_data state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	d4d5e08960	genirq: Add IRQCHIP_SET_TYPE_MASKED flag irq_chips, which require to mask the chip before changing the trigger type should set this flag. So the core takes care of it and the requirement for looking into desc->status in the chip goes away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Walleij <linus.walleij@stericsson.com> Cc: Lars-Peter Clausen <lars@metafoo.de>	2011-02-19 12:58:22 +01:00
Thomas Gleixner	5d4d8fc9ac	genirq: Cleanup irq.h Put the constants into an enum and document them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	f9e4989eb8	genirq: Force wrapped access to desc->status in core code Force the usage of wrappers by another nasty CPP substitution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	1ccb4e612f	genirq: Wrap the remaning IRQ_* flags Use wrappers to keep them away from the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:21 +01:00
Thomas Gleixner	876dbd4cc1	genirq: Mirror irq trigger type bits in irq_data.state That's the data structure chip functions get provided. Also allow them to signal the core code that they updated the flags in irq_data.state by returning IRQ_SET_MASK_OK_NOCOPY. The default is unchanged. The type bits should be accessed via: val = irqd_get_trigger_type(irqdata); and irqd_set_trigger_type(irqdata, val); Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	2bdd10558c	genirq: Move IRQ_AFFINITY_SET to core Keep status in sync until last abuser is gone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	bce43032ad	genirq: Reuse existing can set affinty check Add a !desc check while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	a005677b3d	genirq: Mirror IRQ_PER_CPU and IRQ_NO_BALANCING in irq_data.state That's the right data structure to look at for arch code. Accessor functions are provided. irqd_is_per_cpu(irqdata); irqd_can_balance(irqdata); Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:20 +01:00
Thomas Gleixner	1ce6068dac	genirq: Move debug code to separate header It'll break when I'm going to undefine the constants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	fae581e588	genirq: Remove CHECK_IRQ_PER_CPU from core code Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	6a58fb3bad	genirq: Remove CONFIG_IRQ_PER_CPU The saving of this switch is minimal versus the ifdef mess it creates. Simple enable PER_CPU unconditionally and remove the config switch. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:19 +01:00
Thomas Gleixner	f230b6d5c4	genirq: Add IRQ_MOVE_PENDING to irq_data.state chip implementations need to know about it. Keep status in sync until all users are fixed. Accessor function: irqd_is_setaffinity_pending(irqdata) Coders who access them directly will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	6d2cd17fde	genirq: Move IRQ_WAKEUP to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	c531e8361f	genirq: Move IRQ_SUSPENDED to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:18 +01:00
Thomas Gleixner	6e40262ea4	genirq: Move IRQ_MASKED to core Keep status in sync until all users are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	2a0d6fb335	genirq: Move IRQ_PENDING flag to core Keep status in sync until all users are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	c1594b77e4	genirq: Move IRQ_DISABLED to core Keep status in sync until all abusers are fixed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:17 +01:00
Thomas Gleixner	163ef30911	genirq: Move IRQ_REPLAY and IRQ_WAITING to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	3d67baec7f	genirq: Move IRQ_ONESHOT to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	009b4c3b8a	genirq: Add IRQ_INPROGRESS to core We need to maintain the flag for now in both fields status and istate. Add a CONFIG_GENERIC_HARDIRQS_NO_COMPAT switch to allow testing w/o the status one. Wrap the access to status IRQ_INPROGRESS in a inline which can be turned of with CONFIG_GENERIC_HARDIRQS_NO_COMPAT along with the define. There is no reason that anything outside of core looks at this. That needs some modifications, but we'll get there. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:16 +01:00
Thomas Gleixner	6954b75b48	genirq: Move IRQ_POLL_INPROGRESS to core No users outside of core. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	6f91a52d9b	genirq: Use modify_status for set_irq_nested_thread No need for a separate function in the core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	7acdd53e5b	genirq: Move IRQ_SPURIOUS_DISABLED to core state No users outside. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	bd062e7667	genirq: Move IRQ_AUTODETECT to internal state No users outside of core Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:15 +01:00
Thomas Gleixner	e6bea9c404	genirq: Protect tglx from tripping over his own feet The irq_desc.status field will either go away or renamed to settings. Anyway we need to maintain compatibility to avoid breaking the world and some more. While moving bits into the core, I need to avoid that I use any of the still existing IRQ_ bits in the core code by typos. So that file will hold the inline wrappers and some nasty CPP tricks to break the build when typoed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:14 +01:00
Thomas Gleixner	dbec07bac6	genirq: Add internal state field to irq_desc That field will contain internal state information which is not going to be exposed to anything outside the core code - except via accessor functions. I'm tired of everyone fiddling in irq_desc.status. core_internal_state__do_not_mess_with_it is clear enough, annoying to type and easy to grep for. Offenders will be tracked down and slapped with stinking trouts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:14 +01:00
Thomas Gleixner	35e857cbeb	genirq: Fixup core code namespace fallout Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	c78b9b65fa	genirq: Implement generic irq_show_interrupts() All archs implement show_interrupts() in more or less the same way. That's tons of duplicated code with different bugs with no value. Implement a generic version and deprecate show_interrupts() Unfortunately we need some ifdeffery for !GENERIC_HARDIRQ archs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	1277a5325a	genirq: Simplify handle_irq_event() Now that all core users are converted one layer can go. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	0877d66257	genirq: Use handle_irq_event() in the spurious poll code Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:13 +01:00
Thomas Gleixner	849f061c25	genirq: Use handle_perpcu_event() in handle_percpu_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	a60a5dc2db	genirq: Use handle_irq_event() in handle_edge_irq() It's safe to drop the IRQ_INPROGRESS flag between action chain walks as we are protected by desc->lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	a7ae4de5c8	genirq: Use handle_irq_event() in handle_fasteoi_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	1529866c63	genirq: Use handle_irq_event() in handle_level_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:12 +01:00
Thomas Gleixner	107781e721	genirq: Use handle_irq_event() in handle_simple_irq() Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	4912609f22	genirq: Implement handle_irq_event() Core code replacement for the ugly camel case. It contains all the code which is shared in all handlers. clear status flags set INPROGRESS flag unlock call action chain note_interrupt lock clr INPROGRESS flag Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	d78f8dd36b	genirq: Do not fiddle with IRQ_MASKED in handle_edge_irq() IRQ_MASKED is set in mask_ack_irq() anyway. Remove it from handle_edge_irq() to allow simpler ab^HHreuse of that function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.918484270@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	3aae994fb0	genirq: Consolidate IRQ_DISABLED Handle IRQ_DISABLED consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:11 +01:00
Thomas Gleixner	50f7c03275	genirq: Remove default magic Now that everything uses the wrappers, we can remove the default functions. None of those functions is performance critical. That makes the IRQ_MASKED flag tracking fully consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	87923470c7	genirq: Consolidate disable/enable Create irq_disable/enable and use them to keep the flags consistent. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	4699923861	genirq: Consolidate startup/shutdown of interrupts Aside of duplicated code some of the startup/shutdown sites do not handle the MASKED/DISABLED flags and the depth field at all. Move that to a helper function and take care of it there. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.787481468@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	3b56f0585f	genirq: Remove bogus conditional The if (chip->irq_shutdown) check will always evaluate to true, as we fill in chip->irq_shutdown with default_shutdown in irq_chip_set_defaults() if the chip does not provide its own function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110202212551.667607458@linutronix.de>	2011-02-19 12:58:10 +01:00
Thomas Gleixner	1535dfacbf	genirq: Move irq thread flags to core Soleley used in core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	fe200ae48e	genirq: Mark polled irqs and defer the real handler With the chip.end() function gone we might run into a situation where a poll call runs and the real interrupt comes in, sees IRQ_INPROGRESS and disables the line. That might be a perfect working one, which will then be masked forever. So mark them polled while the poll runs. When the real handler sees IRQ_INPROGRESS it checks the poll flag and waits for the polling to complete. Add the necessary amount of sanity checks to it to avoid deadlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	d05c65fff0	genirq: spurious: Run only one poller at a time No point in running concurrent pollers which confuse each other by setting PENDING. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	c7259cd7af	genirq: Do not poll disabled, percpu and timer interrupts There is no point in polling disabled lines. percpu does not make sense at all because we only poll on the cpu we're currently running on. Also polling per_cpu interrupts is racy as hell. The handler runs without locking so we might get a huge surprise. If the timer interrupt needs polling, then we wont get there anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:09 +01:00
Thomas Gleixner	fa27271bc8	genirq: Fixup poll handling try_one_irq() contains redundant code and lots of useless checks for shared interrupts. Check for shared before setting IRQ_INPROGRESS and then call handle_IRQ_event() while pending. Shorter version with the same functionality. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	b738a50a20	genirq: Warn when handler enables interrupts We run all handlers with interrupts disabled and expect them not to enable them. Warn when we catch one who does. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	1082687e8d	genirq: Plug race in report_bad_irq() We cannot walk the action chain unlocked. Even if IRQ_INPROGRESS is set an action can be removed and we follow a null pointer. It's safe to take the lock there, because the code which removes the action will call synchronize_irq() which waits unlocked for IRQ_INPROGRESS going away. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	2b879eaf09	genirq: Remove redundant thread affinity setting Thread affinity is already set by setup_affinity(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:08 +01:00
Thomas Gleixner	3b8249e759	genirq: Do not copy affinity before set While rumaging through arch code I found that there are a few workarounds which deal with the fact that the initial affinity setting from request_irq() copies the mask into irq_data->affinity before the chip code is called. In the normal path we unconditionally copy the mask when the chip code returns 0. Copy after the code is called and add a return code IRQ_SET_MASK_OK_NOCOPY for the chip functions, which prevents the copy. That way we see the real mask when the chip function decided to truncate it further as some arches do. IRQ_SET_MASK_OK is 0, which is the current behaviour. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	569bda8df1	genirq: Always apply cpu online mask If the affinity had been set by the user, then a later request_irq() will honour that setting. But online cpus can have changed. So apply the online mask and for this case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	b008207cbd	genirq: Rremove redundant check IRQ_NO_BALANCING is already checked in irq_can_set_affinity() above, no need to check it again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	1fa46f1f07	genirq: Simplify affinity related code There is lot of #ifdef CONFIG_GENERIC_PENDING_IRQ along with duplicated code in the irq core. Move the #ifdeffery into one place and cleanup the code so it's readable. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:07 +01:00
Thomas Gleixner	a0cd9ca2b9	genirq: Namespace cleanup The irq namespace has become quite convoluted. My bad. Clean it up and deprecate the old functions. All new functions follow the scheme: irq number based: irq_set/get/xxx/_xxx(unsigned int irq, ...) irq_data based: irq_data_set/get/xxx/_xxx(struct irq_data d, ....) irq_desc based: irq_desc_get_xxx(struct irq_desc desc) Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	43abe43ce0	genirq: Add missing buslock to set_irq_type(), set_irq_wake() chips behind a slow bus cannot update the chip under desc->lock, but we miss the chip_buslock/chip_bus_sync_unlock() calls around the set type and set wake functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	e7bcecb7b1	genirq: Make nr_irqs runtime expandable We face more and more the requirement to expand nr_irqs at runtime. The reason are irq expanders which can not be detected in the early boot stage. So we speculate nr_irqs to have enough room. Further Xen needs extra irq numbers and we really want to avoid adding more "detection" code into the early boot. There is no real good reason why we need to limit nr_irqs at early boot. Allow the allocation code to expand nr_irqs. We have already 8k extra number space in the allocation bitmap, so lets use it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:58:06 +01:00
Thomas Gleixner	218502bfe6	Merge branch 'irq/urgent' into irq/core Reason: Further patches are conflicting with mainline fixes Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-19 12:56:43 +01:00
Thomas Gleixner	6d83f94db9	genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for now With CONFIG_SHIRQ_DEBUG=y we call a newly installed interrupt handler in request_threaded_irq(). The original implementation (commit `a304e1b8`) called the handler _BEFORE_ it was installed, but that caused problems with handlers calling disable_irq_nosync(). See commit `377bf1e4`. It's braindead in the first place to call disable_irq_nosync in shared handlers, but .... Moving this call after we installed the handler looks innocent, but it is very subtle broken on SMP. Interrupt handlers rely on the fact, that the irq core prevents reentrancy. Now this debug call violates that promise because we run the handler w/o the IRQ_INPROGRESS protection - which we cannot apply here because that would result in a possibly forever masked interrupt line. A concurrent real hardware interrupt on a different CPU results in handler reentrancy and can lead to complete wreckage, which was unfortunately observed in reality and took a fricking long time to debug. Leave the code here for now. We want this debug feature, but that's not easy to fix. We really should get rid of those disable_irq_nosync() abusers and remove that function completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Anton Vorontsov <avorontsov@ru.mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Arjan van de Ven <arjan@infradead.org> Cc: stable@kernel.org # .28 -> .37	2011-02-19 12:11:13 +01:00
Thomas Gleixner	c1ee626428	genirq: Prevent access beyond allocated_irqs bitmap Lars-Peter Clausen pointed out: I stumbled upon this while looking through the existing archs using SPARSE_IRQ. Even with SPARSE_IRQ the NR_IRQS is still the upper limit for the number of IRQs. Both PXA and MMP set NR_IRQS to IRQ_BOARD_START, with IRQ_BOARD_START being the number of IRQs used by the core. In various machine files the nr_irqs field of the ARM machine defintion struct is then set to "IRQ_BOARD_START + NR_BOARD_IRQS". As a result "nr_irqs" will greater then NR_IRQS which then again causes the "allocated_irqs" bitmap in the core irq code to be accessed beyond its size overwriting unrelated data. The core code really misses a sanity check there. This went unnoticed so far as by chance the compiler/linker places data behind that bitmap which gets initialized later on those affected platforms. So the obvious fix would be to add a sanity check in early_irq_init() and break all affected platforms. Though that check wants to be backported to stable as well, which will require to fix all known problematic platforms and probably some more yet not known ones as well. Lots of churn. A way simpler solution is to allocate a slightly larger bitmap and avoid the whole churn w/o breaking anything. Add a few warnings when an arch returns utter crap. Reported-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org # .37 Cc: Haojian Zhuang <haojian.zhuang@marvell.com> Cc: Eric Miao <eric.y.miao@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org>	2011-02-19 12:10:51 +01:00
Linus Torvalds	bc3adfc670	Merge branch 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable' workqueue: wake up a worker when a rescuer is leaving a gcwq	2011-02-18 12:36:06 -08:00
Richard Cochran	db1c1cce4a	ntp: Remove redundant and incorrect parameter check The ADJ_SETOFFSET code redundantly checks the range of the nanoseconds field of the time value. This field is checked again in the subsequent call to timekeeping_inject_offset(). Also, as is, the check will not detect whether the number of microseconds is out of range. Let timekeeping_inject_offset() do the error checking. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: johnstul@us.ibm.com LKML-Reference: <20110218090724.GA2924@riccoc20.at.omicron.at> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-18 17:01:12 +01:00
Ingo Molnar	e9345aab67	Revert "tracing: Add unstable sched clock note to the warning" This reverts commit `5e38ca8f3e`. Breaks the build of several !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures. Cc: Jiri Olsa <jolsa@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Message-ID: <20110217171823.GB17058@elte.hu> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-18 08:09:49 +01:00
Ingo Molnar	5beda5f6e4	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-02-17 14:11:15 +01:00
Stanislaw Gruszka	2e725a065b	PM / Hibernate: Return error code when alloc_image_page() fails Currently we return 0 in swsusp_alloc() when alloc_image_page() fails. Fix that. Also remove unneeded "error" variable since the only useful value of error is -ENOMEM. [rjw: Fixed up the changelog and changed subject.] Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Cc: stable@kernel.org Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-02-16 21:53:52 +01:00
Tejun Heo	3233cdbd9f	workqueue: make sure MAYDAY_INITIAL_TIMEOUT is at least 2 jiffies long MAYDAY_INITIAL_TIMEOUT is defined as HZ / 100 and depending on configuration may end up 0 or 1. Even when it's 1, depending on when the mayday timer is added in the current jiffy interval, it may expire way before a jiffy has passed. Make sure MAYDAY_INITIAL_TIMEOUT is at least two to guarantee that at least a full jiffy has passed before calling rescuers. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org	2011-02-16 18:10:19 +01:00
Tejun Heo	58a69cb47e	workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable' There are two spellings in use for 'freeze' + 'able' - 'freezable' and 'freezeable'. The former is the more prominent one. The latter is mostly used by workqueue and in a few other odd places. Unify the spelling to 'freezable'. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alex Dubov <oakad@yahoo.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Steven Whitehouse <swhiteho@redhat.com>	2011-02-16 17:48:59 +01:00
Steven Rostedt	48228f7b47	lockdep/timers: Explain in detail the locking problems del_timer_sync() may cause Twice I had to explain the output about why lockdep gives an error with locks in IRQ context and with del_timer_sync(). Might as well write it up and place it in the comments above the code in del_timer_sync(). Perhaps the next time this lockdep dump triggers people will understand the issues. It is a ticky issue and very subtle, explaining it in detail in the code may help others understand the issue when they stumble upon the bug again. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297186794.23343.19.camel@gandalf.stny.rr.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:35:08 +01:00
Ingo Molnar	a3ec4a603f	Merge commit 'v2.6.38-rc5' into core/locking Merge reason: pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:33:41 +01:00
Venkatesh Pallipadi	46e49b3836	sched: Wholesale removal of sd_idle logic sd_idle logic was introduced way back in 2005 (commit `5969fe06`), as an HT optimization. As per the discussion in the thread here: lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1 https://patchwork.kernel.org/patch/532501/ The capacity based logic in the load balancer right now handles this in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle does not seem to bring any additional benefits. sd_idle logic also has some bugs that has performance impact. Here is the patch that removes the sd_idle logic altogether. Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle logic. Signed-off-by: Venkatesh Pallipadi <venki@google.com> Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297723130-693-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:33:20 +01:00
Ingo Molnar	48fa4b8ecf	Merge commit 'v2.6.38-rc5' into sched/core Merge reason: Pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:31:55 +01:00
Peter Zijlstra	ba3dd36c67	perf: Optimize hrtimer events There is no need to re-initialize the hrtimer every time we start it, so don't do that (shaves a few cycles). Also, since we know hrtimers run at a fixed rate (nanoseconds) we can pre-compute the desired frequency at which they tick. This avoids us having to go through the whole adaptive frequency feedback logic (shaves another few cycles). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1297448589.5226.47.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:57 +01:00
Peter Zijlstra	163ec4354a	perf: Optimize throttling code By pre-computing the maximum number of samples per tick we can avoid a multiplication and a conditional since MAX_INTERRUPTS > max_samples_per_tick. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:55 +01:00
Stephane Eranian	e5d1367f17	perf: Add cgroup support This kernel patch adds the ability to filter monitoring based on container groups (cgroups). This is for use in per-cpu mode only. The cgroup to monitor is passed as a file descriptor in the pid argument to the syscall. The file descriptor must be opened to the cgroup name in the cgroup filesystem. For instance, if the cgroup name is foo and cgroupfs is mounted in /cgroup, then the file descriptor is opened to /cgroup/foo. Cgroup mode is activated by passing PERF_FLAG_PID_CGROUP in the flags argument to the syscall. For instance to measure in cgroup foo on CPU1 assuming cgroupfs is mounted under /cgroup: struct perf_event_attr attr; int cgroup_fd, fd; cgroup_fd = open("/cgroup/foo", O_RDONLY); fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP); close(cgroup_fd); Signed-off-by: Stephane Eranian <eranian@google.com> [ added perf_cgroup_{exit,attach} ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:48 +01:00
Peter Zijlstra	d41d5a0163	cgroup: Fix cgroup_subsys::exit callback Make the ::exit method act like ::attach, it is after all very nearly the same thing. The bug had no effect on correctness - fixing it is an optimization for the scheduler. Also, later perf-cgroups patches rely on it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Menage <menage@google.com> LKML-Reference: <1297160655.13327.92.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:47 +01:00
Ingo Molnar	b00560f2d4	Merge branch 'perf/urgent' into perf/core Merge reason: we need to queue up dependent patch Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:27:23 +01:00
Peter Zijlstra	4fe757dd48	perf: Fix throttle logic It was possible to call pmu::start() on an already running event. In particular this lead so some wreckage as the hrtimer events would re-initialize active timers. This was due to throttled events being activated again by scheduling. Scheduling in a context would add and force start events, resulting in running events with a possible throttle status. The next tick to hit that task will then try to unthrottle the event and call ->start() on an already running event. Reported-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:25:29 +01:00
Linus Torvalds	a1213b091c	Merge branches 'core-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: Revert "lockdep, timer: Fix del_timer_sync() annotation" * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timer debug: Hide kernel addresses via %pK in /proc/timer_list	2011-02-15 10:19:18 -08:00
Linus Torvalds	1cecd791f2	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix text_poke_smp_batch() deadlock perf tools: Fix thread_map event synthesizing in top and record watchdog, nmi: Lower the severity of error messages ARM: oprofile: Fix backtraces in timer mode oprofile: Fix usage of CONFIG_HW_PERF_EVENTS for oprofile_perf_init and friends	2011-02-15 10:18:48 -08:00
Jiri Kosina	0a9d59a246	Merge branch 'master' into for-next	2011-02-15 10:24:31 +01:00
Chris Smith	d4f7e51323	sh: Enable CONFIG_GCOV_PROFILE_ALL for sh This patch enables gcov kernel profiling over the whole kernel for sh. Profiling of specific files individually already worked. A handful of files have to be explicitly excluded from the profiling to avoid breaking things, notably pmb.c. Signed-off-by: Chris Smith <chris.smith@st.com> Signed-off-by: Stuart Menefy <stuart.menefy@st.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>	2011-02-15 16:47:17 +09:00
Ingo Molnar	bf1af3a809	Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-02-14 15:15:16 +01:00
Tejun Heo	7576958a9d	workqueue: wake up a worker when a rescuer is leaving a gcwq After executing the matching works, a rescuer leaves the gcwq whether there are more pending works or not. This may decrease the concurrency level to zero and stall execution until a new work item is queued on the gcwq. Make rescuer wake up a regular worker when it leaves a gcwq if there are more works to execute, so that execution isn't stalled. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ray Jui <rjui@broadcom.com> Cc: stable@kernel.org	2011-02-14 14:04:46 +01:00
Masami Hiramatsu	0de4b34d46	tracing/kprobe: Fix NULL pointer deref check Add NULL check for avoiding NULL pointer deref. This bug has been introduced by: 1ff511e35ed8: tracing/kprobes: Add bitfield type which causes a null pointer dereference bug when kprobe-tracer parses an argument without type. Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110214054807.8919.69740.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>	2011-02-14 12:09:53 +01:00
Kees Cook	f590308536	timer debug: Hide kernel addresses via %pK in /proc/timer_list In the continuing effort to avoid kernel addresses leaking to unprivileged users, this patch switches to %pK for /proc/timer_list reporting. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110212032125.GA23571@outflux.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-12 14:11:56 +01:00
Linus Torvalds	f7909fb835	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: pci: use security_capable() when checking capablities during config space read security: add cred argument to security_capable() tpm_tis: Use timeouts returned from TPM	2011-02-11 16:16:03 -08:00
Tejun Heo	01e05e9a90	ptrace: use safer wake up on ptrace_detach() The wake_up_process() call in ptrace_detach() is spurious and not interlocked with the tracee state. IOW, the tracee could be running or sleeping in any place in the kernel by the time wake_up_process() is called. This can lead to the tracee waking up unexpectedly which can be dangerous. The wake_up is spurious and should be removed but for now reduce its toxicity by only waking up if the tracee is in TRACED or STOPPED state. This bug can possibly be used as an attack vector. I don't think it will take too much effort to come up with an attack which triggers oops somewhere. Most sleeps are wrapped in condition test loops and should be safe but we have quite a number of places where sleep and wakeup conditions are expected to be interlocked. Although the window of opportunity is tiny, ptrace can be used by non-privileged users and with some loading the window can definitely be extended and exploited. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-11 16:12:19 -08:00
Steven Rostedt	868baf07b1	ftrace: Fix memory leak with function graph and cpu hotplug When the fuction graph tracer starts, it needs to make a special stack for each task to save the real return values of the tasks. All running tasks have this stack created, as well as any new tasks. On CPU hot plug, the new idle task will allocate a stack as well when init_idle() is called. The problem is that cpu hotplug does not create a new idle_task. Instead it uses the idle task that existed when the cpu went down. ftrace_graph_init_task() will add a new ret_stack to the task that is given to it. Because a clone will make the task have a stack of its parent it does not check if the task's ret_stack is already NULL or not. When the CPU hotplug code starts a CPU up again, it will allocate a new stack even though one already existed for it. The solution is to treat the idle_task specially. In fact, the function_graph code already does, just not at init_idle(). Instead of using the ftrace_graph_init_task() for the idle task, which that function expects the task to be a clone, have a separate ftrace_graph_init_idle_task(). Also, we will create a per_cpu ret_stack that is used by the idle task. When we call ftrace_graph_init_idle_task() it will check if the idle task's ret_stack is NULL, if it is, then it will assign it the per_cpu ret_stack. Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stable Tree <stable@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-11 16:23:33 -05:00
Arnaldo Carvalho de Melo	7c940c18c5	Merge remote branch 'acme/perf/urgent' into perf/core Fixups due to rename of event_t routines from event__ to perf_event__ done in perf/core. Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/util/event.c tools/perf/util/event.h Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-11 11:45:54 -02:00
Chris Wright	6037b715d6	security: add cred argument to security_capable() Expand security_capable() to include cred, so that it can be usable in a wider range of call sites. Signed-off-by: Chris Wright <chrisw@sous-sol.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>	2011-02-11 17:41:58 +11:00
Linus Torvalds	ee24aebffb	cap_syslog: accept CAP_SYS_ADMIN for now In commit `ce6ada35bd` ("security: Define CAP_SYSLOG") Serge Hallyn introduced CAP_SYSLOG, but broke backwards compatibility by no longer accepting CAP_SYS_ADMIN as an override (it would cause a warning and then reject the operation). Re-instate CAP_SYS_ADMIN - but keeping the warning - as an acceptable capability until any legacy applications have been updated. There are apparently applications out there that drop all capabilities except for CAP_SYS_ADMIN in order to access the syslog. (This is a re-implementation of a patch by Serge, cleaning the logic up and making the code more readable) Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-10 17:53:55 -08:00
Thomas Gleixner	51327ada71	Merge branch 'irq/for-mips' into irq/core Reason: irq/for-mips is provided for mips to make core independent progress. Merge it into irq/core to avoid conflicts Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-11 00:27:03 +01:00
David Daney	986c011ddb	genirq: Call bus_lock/unlock functions in setup_irq() irq_chips that supply .irq_bus_lock/.irq_bus_sync_unlock functions, expect that the other chip methods will be called inside of calls to the pair. If this expectation is not met, things tend to not work. Make setup_irq() call chip_bus_lock()/chip_bus_sync_unlock() too. For the vast majority of irq_chips, this will be a NOP as most don't have these bus lock functions. [ tglx: No we don't want to call that in __setup_irq(). Way too many error exit pathes. ] Signed-off-by: David Daney <ddaney@caviumnetworks.com> LKML-Reference: <1297296265-18680-1-git-send-email-ddaney@caviumnetworks.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-11 00:25:13 +01:00
Don Zickus	5651f7f47d	watchdog, nmi: Lower the severity of error messages During boot if the hardlockup detector fails to initialize, it complains very loudly. Some failures should be expected under certain situations, ie no lapics, or resource in-use. Tone those error messages down a bit. Keep the rest at a high level. Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1297278153-21111-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-10 13:21:59 +01:00
Linus Torvalds	aceb91cd35	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: cdrom: support devices that have check_events but not media_changed cfq-iosched: Don't wait if queue already has requests. blkio-throttle: Avoid calling blkiocg_lookup_group() for root group cfq: rename a function to give it more appropriate name cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily. drivers/block/aoe/Makefile: replace the use of <module>-objs with <module>-y loop: queue_lock NULL pointer derefence in blk_throtl_exit drivers/block/Makefile: replace the use of <module>-objs with <module>-y blktrace: Don't output messages if NOTIFY isn't set.	2011-02-09 11:45:21 -08:00
Tejun Heo	4149efb22d	workqueue: add system_freezeable_wq Add system wide freezeable workqueue. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl>	2011-02-09 09:37:49 +01:00
Steven Rostedt	6752ab4a9c	tracing: Deprecate tracing_enabled for tracing_on tracing_enabled should not be used, it is heavy weight and does not do much in helping lower the overhead. tracing_on should be used instead. Warn users to use tracing_on when tracing_enabled is used as it will soon be removed from the tracing directory. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 17:14:58 -05:00
Steven Rostedt	87d80de280	tracing: Remove obsolete sched_switch tracer The trace events sched_switch and sched_wakeup do the same thing as the stand alone sched_switch tracer does. It is no longer needed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 17:14:56 -05:00
Thomas Gleixner	44951a60ff	genirq: Remove dead code CONFIG_KSTAT_IRQS_ONDEMAND does not exist. It's not worth to implement it. Use sparse irqs if you care about memory consumption of the interrupt layer. Found by undertaker: http://vamos.informatik.uni-erlangen.de/trac/undertaker Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 19:37:13 +01:00
Thomas Gleixner	c305d524e5	softirq: Avoid stack switch from ksoftirqd ksoftirqd() calls do_softirq() which switches stacks on several architectures. That makes no sense at all. ksoftirqd's stack is sufficient. Call __do_softirq() directly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: David Miller <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> LKML-Reference: <alpine.LFD.2.00.1102021704530.31804@localhost6.localdomain6>	2011-02-08 19:37:12 +01:00
Jiri Olsa	5e38ca8f3e	tracing: Add unstable sched clock note to the warning The warning "Delta way too big" warning might appear on a system with unstable shed clock right after the system is resumed and tracing was enabled during the suspend. Since it's not realy bug, and the unstable sched clock is working fast and reliable otherwise, Steven suggested to keep using the sched clock in any case and just to make note in the warning itself. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1296649698-6003-1-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-08 11:57:22 -05:00
Thomas Gleixner	c9a443cdf7	Merge branch 'irq/for-xen' into irq/core irq/for-xen contains new functionality to avoid Xen private irq hackery. That branch has a single irq commit and is pulled by Xen to base their new features on. Merge it into irq/core as other patches modify the same code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 16:38:21 +01:00
Thomas Gleixner	dc5f219e88	genirq: Add IRQF_FORCE_RESUME Xen needs to reenable interrupts which are marked IRQF_NO_SUSPEND in the resume path. Add a flag to force the reenabling in the resume code. Tested-and-acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-08 16:36:47 +01:00
Peter Zijlstra	7ff207928e	Revert "lockdep, timer: Fix del_timer_sync() annotation" Both attempts at trying to allow softirq usage for del_timer_sync() failed (produced bogus warnings), so revert the commit for this release: f266a5110d45: lockdep, timer: Fix del_timer_sync() annotation and try again later. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <1297174680.13327.107.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-08 16:18:39 +01:00
Ian Munsie	ae07f551c4	tracing/syscalls: Early terminate search for sys_ni_syscall Many system calls are unimplemented and mapped to sys_ni_syscall, but at boot ftrace would still search through every syscall metadata entry for a match which wouldn't be there. This patch adds causes the search to terminate early if the system call is not mapped. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-7-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:30:14 -05:00
Ian Munsie	b2d5549681	tracing/syscalls: Allow arch specific syscall symbol matching Some architectures have unusual symbol names and the generic code to match the symbol name with the function name for the syscall metadata will fail. For example, symbols on PPC64 start with a period and the generic code will fail to match them. This patch moves the match logic out into a separate function which an arch can override by defining ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and implementing arch_syscall_match_sym_name. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-5-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:29:03 -05:00
Ian Munsie	c763ba06bd	tracing/syscalls: Make arch_syscall_addr weak Some architectures use non-trivial system call tables and will not work with the generic arch_syscall_addr code. For example, PowerPC64 uses a table of twin long longs. This patch makes the generic arch_syscall_addr weak to allow architectures with non-trivial system call tables to override it. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-4-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:26:22 -05:00
Ian Munsie	3773b389b6	tracing/syscalls: Convert redundant syscall_nr checks into WARN_ON With the ftrace events now checking if the syscall_nr is valid upon initialisation it should no longer be possible to register or unregister a syscall event without a valid syscall_nr since they should not be created. This adds a WARN_ON_ONCE in the register and unregister functions to locate potential regressions in the future. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-3-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:25:52 -05:00
Ian Munsie	ba976970c7	tracing/syscalls: Don't add events for unmapped syscalls FTRACE_SYSCALLS would create events for each and every system call, even if it had failed to map the system call's name with it's number. This resulted in a number of events being created that would not behave as expected. This could happen, for example, on architectures who's symbol names are unusual and will not match the system call name. It could also happen with system calls which were mapped to sys_ni_syscall. This patch changes the default system call number in the metadata to -1. If the system call name from the metadata is not successfully mapped to a system call number during boot, than the event initialisation routine will now return an error, preventing the event from being created. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> LKML-Reference: <1296703645-18718-2-git-send-email-imunsie@au1.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 21:24:44 -05:00
Steven Rostedt	4defe682d8	tracing/filter: Remove synchronize_sched() from __alloc_preds() Because the filters are processed first and then activated (added to the call), we no longer need to worry about the preds of the filter in __alloc_preds() being used. As the filter that is allocating preds is not activated yet. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	75b8e98263	tracing/filter: Swap entire filter of events When creating a new filter, instead of allocating the filter to the event call first and then processing the filter, it is easier to process a temporary filter and then just swap it with the call filter. By doing this, it simplifies the code. A filter is allocated and processed, when it is done, it is swapped with the call filter, synchronize_sched() is called to make sure all callers are done with the old filter (filters are called with premption disabled), and then the old filter is freed. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	bf93f9ed3a	tracing/filter: Increase the max preds to 2^14 Now that the filter logic does not require to save the pred results on the stack, we can increase the max number of preds we allow. As the preds are index by a short value, and we use the MSBs as flags we can increase the max preds to 2^14 (16384) which should be way more than enough. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	4a3d27e98a	tracing/filter: Move MAX_FILTER_PRED to local tracing directory The MAX_FILTER_PRED is only needed by the kernel/trace/*.c files. Move it to kernel/trace/trace.h. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:20 -05:00
Steven Rostedt	43cd414552	tracing/filter: Optimize filter by folding the tree There are many cases that a filter will contain multiple ORs or ANDs together near the leafs. Walking up and down the tree to get to the next compare can be a waste. If there are several ORs or ANDs together, fold them into a single pred and allocate an array of the conditions that they check. This will speed up the filter by linearly walking an array and can still break out if a short circuit condition is met. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	ec126cac23	tracing/filter: Check the created pred tree Since the filter walks a tree to determine if a match is made or not, if the tree was incorrectly created, it could cause an infinite loop. Add a check to walk the entire tree before assigning it as a filter to make sure the tree is correct. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	5571927418	tracing/filter: Optimize short ciruit check The test if we should break out early for OR and AND operations can be optimized by comparing the current result with (pred->op == OP_OR) That is if the result is true and the op is an OP_OR, or if the result is false and the op is not an OP_OR (thus an OP_AND) we can break out early in either case. Otherwise we continue processing. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	61e9dea20e	tracing/filter: Use a tree instead of stack for filter_match_preds() Currently the filter_match_preds() requires a stack to push and pop the preds to determine if the filter matches the record or not. This has two drawbacks: 1) It requires a stack to store state information. As this is done in fast paths we can't allocate the storage for this stack, and we can't use a global as it must be re-entrant. The stack is stored on the kernel stack and this greatly limits how many preds we may allow. 2) All conditions are calculated even when a short circuit exists. a \|\| b will always calculate a and b even though a was determined to be true. Using a tree we can walk a constant structure that will save the state as we go. The algorithm is simply: pred = root; do { switch (move) { case MOVE_DOWN: if (OR or AND) { pred = left; continue; } if (pred == root) break; match = pred->fn(); pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; case MOVE_UP_FROM_LEFT: /* Only OR or AND can be a parent / if (match && OR \|\| !match && AND) { / short circuit */ if (pred == root) break; pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; } pred = pred->right; move = MOVE_DOWN; continue; case MOVE_UP_FROM_RIGHT: if (pred == root) break; pred = pred->parent; move = left child ? MOVE_UP_FROM_LEFT : MOVE_UP_FROM_RIGHT; continue; } done = 1; } while (!done); This way there's no strict limit to how many preds we allow and it also will short circuit the logical operations when possible. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:19 -05:00
Steven Rostedt	f76690afd0	tracing/filter: Free pred array on disabling of filter When a filter is disabled, free the preds. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	74e9e58c35	tracing/filter: Allocate the preds in an array Currently we allocate an array of pointers to filter_preds, and then allocate a separate filter_pred for each item in the array. This adds slight overhead in the filters as it needs to derefernce twice to get to the op condition. Allocating the preds themselves in a single array removes a dereference as well as helps on the cache footprint. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	0fc3ca9a10	tracing/filter: Call synchronize_sched() just once for system filters By separating out the reseting of the filter->n_preds to zero from the reallocation of preds for the filter, we can reset groups of filters first, call synchronize_sched() just once, and then reallocate each of the filters in the system group. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	c9c53ca03d	tracing/filter: Dynamically allocate preds For every filter that is made, we create predicates to hold every operation within the filter. We have a max of 32 predicates that we can hold. Currently, we allocate all 32 even if we only need to use one. Part of the reason we do this is that the filter can be used at any moment by any event. Fortunately, the filter is only used with preemption disabled. By reseting the count of preds used "n_preds" to zero, then performing a synchronize_sched(), we can safely free and reallocate a new array of preds. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	58d9a597c4	tracing/filter: Move OR and AND logic out of fn() method The ops OR and AND act different from the other ops, as they are the only ones to take other ops as their arguements. These ops als change the logic of the filter_match_preds. By removing the OR and AND fn's we can also remove the val1 and val2 that is passed to all other fn's and are unused. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:18 -05:00
Steven Rostedt	6d54057d76	tracing/filter: Have no filter return a match The n_preds field of a file can change at anytime, and even can become zero, just as the filter is about to be processed by an event. In the case that is zero on entering the filter, return 1, telling the caller the event matchs and should be trace. Also use a variable and assign it with ACCESS_ONCE() such that the count stays consistent within the function. Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-07 20:56:17 -05:00
Tetsuo Handa	fb2b2a1d37	CRED: Fix memory and refcount leaks upon security_prepare_creds() failure In prepare_kernel_cred() since 2.6.29, put_cred(new) is called without assigning new->usage when security_prepare_creds() returned an error. As a result, memory for new and refcount for new->{user,group_info,tgcred} are leaked because put_cred(new) won't call __put_cred() unless old->usage == 1. Fix these leaks by assigning new->usage (and new->subscribers which was added in 2.6.32) before calling security_prepare_creds(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-07 14:04:00 -08:00
Tetsuo Handa	2edeaa34a6	CRED: Fix BUG() upon security_cred_alloc_blank() failure In cred_alloc_blank() since 2.6.32, abort_creds(new) is called with new->security == NULL and new->magic == 0 when security_cred_alloc_blank() returns an error. As a result, BUG() will be triggered if SELinux is enabled or CONFIG_DEBUG_CREDENTIALS=y. If CONFIG_DEBUG_CREDENTIALS=y, BUG() is called from __invalid_creds() because cred->magic == 0. Failing that, BUG() is called from selinux_cred_free() because selinux_cred_free() is not expecting cred->security == NULL. This does not affect smack_cred_free(), tomoyo_cred_free() or apparmor_cred_free(). Fix these bugs by (1) Set new->magic before calling security_cred_alloc_blank(). (2) Handle null cred->security in creds_are_invalid() and selinux_cred_free(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-07 14:04:00 -08:00
Masami Hiramatsu	1ff511e35e	tracing/kprobes: Add bitfield type Add bitfield type for tracing arguments on kprobe-tracer. The syntax of a bitfield type is: b<bit-size>@<bit-offset>/<container-size> e.g. Accessing 2 bits-width field with 4 bits-offset in 32 bits-width data at 4 bytes offseted from the address pointed by AX register: +4(%ax):b2@4/32 Since the width of container data depends on the arch, so I just added the container-size at the end. Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125205.9507.11363.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:18:11 -02:00
Masami Hiramatsu	e374536998	tracing/kprobes: Support longer (>128 bytes) command Expand command line buffer of kprobe-tracer to 4096 bytes. Reported-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125159.9507.20895.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:15:59 -02:00
Masami Hiramatsu	76022db323	tracing/kprobes: Cleanup strict_strtol() using code Since strict_strtol() accepts minus digits started with '-', it doesn't need to invert after converting. Cc: 2nddept-manager@sdl.hitachi.co.jp Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20110204125153.9507.49335.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-02-07 09:13:36 -02:00
Ingo Molnar	c7f9a6f377	Merge branch 'linus' into perf/core Merge reason: Pick up perf fixes that are now upstream Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-07 08:44:26 +01:00
Linus Torvalds	f0adc82064	Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep, timer: Fix del_timer_sync() annotation RTC: Prevents a division by zero in kernel code.	2011-02-06 12:05:15 -08:00
Thomas Gleixner	285c1a2c3a	Merge branch 'irq/urgent' into irq/core Reason: Get mainline fixes integrated. Further patches conflict with them Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-05 21:50:05 +01:00
Ingo Molnar	862b6f62bf	Merge branch 'tip/perf/urgent-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent	2011-02-04 19:02:53 +01:00
Peter Zijlstra	f266a5110d	lockdep, timer: Fix del_timer_sync() annotation Calling local_bh_enable() will want to actually start processing softirqs, which isn't a good idea since this can get called with IRQs disabled. Cure this by using _local_bh_enable() which doesn't start processing softirqs, and use raw_local_irq_save() to avoid any softirqs from happening without letting lockdep think IRQs are in fact disabled. Reported-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> LKML-Reference: <20110203141548.039540914@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-04 10:31:22 +01:00
Linus Torvalds	aba99437f5	Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Prevent irq storm on migration	2011-02-03 09:17:41 -08:00
Linus Torvalds	49abda9892	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix update_curr_rt() sched, docs: Update schedstats documentation to version 15	2011-02-03 08:55:07 -08:00
Linus Torvalds	eb487ab4d5	Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf: Fix reading in perf_event_read() watchdog: Don't change watchdog state on read of sysctl watchdog: Fix sysctl consistency watchdog: Fix broken nowatchdog logic perf: Fix Pentium4 raw event validation perf: Fix alloc_callchain_buffers()	2011-02-03 08:52:05 -08:00
Steven Rostedt	3d56e331b6	tracing: Replace syscall_meta_data struct array with pointer array Currently the syscall_meta structures for the syscall tracepoints are placed in the __syscall_metadata section, and at link time, the linker makes one large array of all these syscall metadata structures. On boot up, this array is read (much like the initcall sections) and the syscall data is processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the __syscall_metadata section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The __syscall_metadata section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-03 09:29:06 -05:00
Mathieu Desnoyers	6549864629	tracepoints: Fix section alignment using pointer array Make the tracepoints more robust, making them solid enough to handle compiler changes by not relying on anything based on compiler-specific behavior with respect to structure alignment. Implement an approach proposed by David Miller: use an array of const pointers to refer to the individual structures, and export this pointer array through the linker script rather than the structures per se. It will consume 32 extra bytes per tracepoint (24 for structure padding and 8 for the pointers), but are less likely to break due to compiler changes. History: commit `7e066fb8` tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() added the aligned(32) type and variable attribute to the tracepoint structures to deal with gcc happily aligning statically defined structures on 32-byte multiples. One attempt was to use a 8-byte alignment for tracepoint structures by applying both the variable and type attribute to tracepoint structures definitions and declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5. The reason is that the "aligned" attribute only specify the _minimum_ alignment for a structure, leaving both the compiler and the linker free to align on larger multiples. Because tracepoint.c expects the structures to be placed as an array within each section, up-alignment cause NULL-pointer exceptions due to the extra unexpected padding. (this patch applies on top of -tip) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> LKML-Reference: <20110126222622.GA10794@Krystal> CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: Andrew Morton <akpm@linux-foundation.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-03 09:28:46 -05:00
Mike Galbraith	d95f412200	sched: Add yield_to(task, preempt) functionality Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group. Implemented via a scheduler hint, using cfs_rq->next to encourage the target being selected. We can rely on pick_next_entity to keep things fair, so noone can accelerate a thread that has already used its fair share of CPU time. This also means callers should only call yield_to when they really mean it. Calling it too often can result in the scheduler just ignoring the hint. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:33 +01:00
Rik van Riel	ac53db596c	sched: Use a buddy to implement yield_task_fair() Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the "next" and "yield" task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the "yield" hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:33 +01:00
Rik van Riel	2c13c919d9	sched: Limit the scope of clear_buddies The clear_buddies function does not seem to play well with the concept of hierarchical runqueues. In the following tree, task groups are represented by 'G', tasks by 'T', next by 'n' and last by 'l'. (nl) / \ G(nl) G / \ \ T(l) T(n) T This situation can arise when a task is woken up T(n), and the previously running task T(l) is marked last. When clear_buddies is called from either T(l) or T(n), the next and last buddies of the group G(nl) will be cleared. This is not the desired result, since we would like to be able to find the other type of buddy in many cases. This especially a worry when implementing yield_task_fair through the buddy system. The fix is simple: only clear the buddy type that the task itself is indicated to be. As an added bonus, we stop walking up the tree when the buddy has already been cleared or pointed elsewhere. Signed-off-by: Rik van Riel <riel@redhat.coM> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:32 +01:00
Rik van Riel	725e7580aa	sched: Check the right ->nr_running in yield_task_fair() With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq. Yielding to a task from another cfs_rq may be worthwhile, since a process calling yield typically cannot use the CPU right now. Therefor, we want to check the per-cpu nr_running, not the cgroup local one. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 14:20:32 +01:00
Peter Zijlstra	06c3bc6556	sched: Fix update_curr_rt() cpu_stopper_thread() migration_cpu_stop() __migrate_task() deactivate_task() dequeue_task() dequeue_task_rq() update_curr_rt() Will call update_curr_rt() on rq->curr, which at that time is rq->stop. The problem is that rq->stop.prio matches an RT prio and thus falsely assumes its a rt_sched_class task. Reported-Debuged-Tested-Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: stable@kernel.org # .37 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:21:33 +01:00
Peter Zijlstra	542e72fc90	perf: Fix reading in perf_event_read() It is quite possible for the event to have been disabled between perf_event_read() sending the IPI and the CPU servicing the IPI and calling __perf_event_read(), hence revalidate the state. Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:15:46 +01:00
Peter Zijlstra	fe4b04fa31	perf: Cure task_oncpu_function_call() races Oleg reported that on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from task_oncpu_function_call() can land before perf_event_task_sched_in() and cause interesting situations for eg. perf_install_in_context(). This patch reworks the task_oncpu_function_call() interface to give a more usable primitive as well as rework all its users to hopefully be more obvious as well as remove the races. While looking at the code I also found a number of races against perf_event_task_sched_out() which can flip contexts between tasks so plug those too. Reported-and-reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-03 12:14:43 +01:00
Steven Rostedt	e4a9ea5ee7	tracing: Replace trace_event struct array with pointer array Currently the trace_event structures are placed in the _ftrace_events section, and at link time, the linker makes one large array of all the trace_event structures. On boot up, this array is read (much like the initcall sections) and the events are processed. The problem is that there is no guarantee that gcc will place complex structures nicely together in an array format. Two structures in the same file may be placed awkwardly, because gcc has no clue that they are suppose to be in an array. A hack was used previous to force the alignment to 4, to pack the structures together. But this caused alignment issues with other architectures (sparc). Instead of packing the structures into an array, the structures' addresses are now put into the _ftrace_event section. As pointers are always the natural alignment, gcc should always pack them tightly together (otherwise initcall, extable, etc would also fail). By having the pointers to the structures in the section, we can still iterate the trace_events without causing unnecessary alignment problems with other architectures, or depending on the current behaviour of gcc that will likely change in the future just to tick us kernel developers off a little more. The _ftrace_event section is also moved into the .init.data section as it is now only needed at boot up. Suggested-by: David Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-02-02 21:37:13 -05:00
Thomas Gleixner	f1a06390d0	genirq: Prevent irq storm on migration move_native_irq() masks and unmasks the interrupt line unconditionally, but the interrupt line might be masked due to a threaded oneshot handler in progress. Unmasking the line in that case can lead to interrupt storms. Observed on PREEMPT_RT. Originally-from: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org	2011-02-02 22:15:08 +01:00
Richard Cochran	0606f422b4	posix clocks: Introduce dynamic clocks This patch adds support for adding and removing posix clocks. The clock lifetime cycle is patterned after usb devices. Each clock is represented by a standard character device. In addition, the driver may optionally implement custom character device operations. The posix clock and timer system calls listed below now work with dynamic posix clocks, as well as the traditional static clocks. The following system calls are affected: - clock_adjtime (brand new syscall) - clock_gettime - clock_getres - clock_settime - timer_create - timer_delete - timer_gettime - timer_settime [ tglx: Adapted to the posix-timer cleanup. Moved clock_posix_dynamic to posix-clock.c and made all referenced functions static ] Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134420.164172635@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:20 +01:00
Thomas Gleixner	527087374f	posix-timers: Cleanup namespace Rename register_posix_clock() to posix_timers_register_clock(). That's what the function really does. As a side effect this cleans up the posix_clock namespace for the upcoming dynamic posix_clock infrastructure. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Richard Cochran <richard.cochran@omicron.at> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <alpine.LFD.2.00.1102021222240.31804@localhost6.localdomain6>	2011-02-02 15:28:19 +01:00
Richard Cochran	81e294cba2	posix-timers: Add support for fd based clocks Extend the negative clockids which are currently used by posix cpu timers to encode the PID with a file descriptor based type which encodes the fd in the upper bits. Originally-from: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134420.062860200@linutronix.de>	2011-02-02 15:28:19 +01:00
Richard Cochran	f1f1d5ebd1	posix-timers: Introduce a syscall for clock tuning. A new syscall is introduced that allows tuning of a POSIX clock. The new call, clock_adjtime, takes two parameters, the clock ID and a pointer to a struct timex. Any ADJTIMEX(2) operation may be requested via this system call, but various POSIX clocks may or may not support tuning. [ tglx: Adapted to the posix-timer cleanup series. Avoid copy_to_user in the error case ] Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134419.869804645@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:19 +01:00
Richard Cochran	65f5d80bdf	time: Splitout compat timex accessors Split out the compat timex accessors into separate functions. Preparatory patch for a new syscall. [ tglx: Split that patch from Richards "posix-timers: Introduce a syscall for clock tuning.". Keeps the changes strictly separate ] Originally-from: Richard Cochran <richardcochran@gmail.com> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <20110201134419.772343089@linutronix.de>	2011-02-02 15:28:18 +01:00
Richard Cochran	094aa1881f	ntp: Add ADJ_SETOFFSET mode bit This patch adds a new mode bit into the timex structure. When set, the bit instructs the kernel to add the given time value to the current time. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134320.688829863@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:18 +01:00
John Stultz	c528f7c6c2	time: Introduce timekeeping_inject_offset This adds a kernel-internal timekeeping interface to add or subtract a fixed amount from CLOCK_REALTIME. This makes it so kernel users or interfaces trying to do so do not have to read the time, then add an offset and then call settimeofday(), which adds some extra error in comparision to just simply adding the offset in the kernel timekeeping core. Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.584311693@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:18 +01:00
Richard Cochran	0061748dd2	posix-timer: Update comment Pick the cleanup to the comment in posix-timers.c from Richards all in one conversion patch. Originally-from: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134419.487708516@linutronix.de>	2011-02-02 15:28:18 +01:00
Thomas Gleixner	bc2c8ea483	posix-timers: Make posix-cpu-timers functions static All functions are accessed via clock_posix_cpu now. So make them static. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.389755466@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	0aa3975f02	posix-timers: Remove CLOCK_DISPATCH leftovers All users gone. Remove the cruft. Huge thanks to Richard Cochran who tackled that maze first. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.294620613@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	6761c6702e	posix-timers: Convert timer_delete() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.198999420@linutronix.de>	2011-02-02 15:28:17 +01:00
Thomas Gleixner	a7319fa253	posix-timers: Convert timer_gettime() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.101243181@linutronix.de>	2011-02-02 15:28:16 +01:00
Thomas Gleixner	27722df16e	posix-timers: Convert timer_settime() to clockid_to_kclock() Set the common function for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and use the new decoding function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134419.001863714@linutronix.de>	2011-02-02 15:28:16 +01:00
Thomas Gleixner	838394fbf9	posix-timers: Convert timer_create() to clockid_to_kclock() Setup timer_create for CLOCK_MONOTONIC and CLOCK_REALTIME kclocks and remove the no_timer_create() implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.903604289@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	ebaac757ac	posix-timers: Remove useless res field from k_clock The res member of kclock is only used by mmtimer.c, but even there it contains redundant information. Remove the field and fixup mmtimer. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.808714587@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	e5e542eea9	posix-timers: Convert clock_getres() to clockid_to_kclock() Use the new kclock decoding. Fixup the fallout in mmtimer.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.709802797@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	4359ac0ace	posix-timers: Make clock_getres and clock_get mandatory Richard said: "I would think that we can require k_clocks to provide the read function. This could be checked and enforced in register_posix_clock()." Add checks for clock_getres and clock_get in the register function. Suggested-by: Richard Cochran <richardcochran@gmail.com> Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:15 +01:00
Thomas Gleixner	4228577763	posix-timers: Convert clock_gettime() to clockid_to_kclock() Use the new kclock decoding mechanism and rename the misnomed common_clock_get() to posix_clock_realtime_get(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.611097203@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	26f9a4796a	posix-timers: Convert clock_settime to clockid_to_kclock() Use the new kclock decoding function in clock_settime and cleanup all kclocks which use the default functions. Rename the misnomed common_clock_set() to posix_clock_realtime_set(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.518851246@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	79c9da0d05	posix-cpu-timers: Remove the stub nanosleep functions CLOCK_THREAD_CPUTIME_ID implements stub functions for nanosleep and nanosleep_restart, which return -EINVAL. That return value is wrong. The correct return value is -ENOTSUP. Remove the stubs and let the new dispatch code return the correct error code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.422446502@linutronix.de>	2011-02-02 15:28:14 +01:00
Thomas Gleixner	3751f9f29b	posix-timers: Cleanup restart_block usage posix timers still use the legacy arg0-arg3 members of restart_block. Use restart_block.nanosleep instead Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.232288779@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	59bd5bc24a	posix-timers: Convert clock_nanosleep_restart to clockid_to_kclock() Use the new kclock decoding function in clock_nanosleep_restart. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.131263211@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	a5cd288010	posix-timers: Convert clock_nanosleep to clockid_to_kclock() Use the new kclock decoding function in clock_nanosleep and cleanup all kclocks which use the default functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134418.034175556@linutronix.de>	2011-02-02 15:28:13 +01:00
Thomas Gleixner	cc785ac22b	posix-timers: Introduce clockid_to_kclock() New function to find the kclock for a given clockid. Returns a pointer to clock_posix_cpu if clockid < 0. If clockid >= MAXCLOCK or if the clock_getres pointer is not set it returns NULL. For valid clocks it returns a pointer to the matching posix_clock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <johnstul@us.ibm.com> Acked-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.938447839@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	1976945eea	posix-timers: Introduce clock_posix_cpu The CLOCK_DISPATCH() macro is a horrible magic. We call common functions if a function pointer is not set. That's just backwards. To support dynamic file decriptor based clocks we need to cleanup that dispatch logic. Create a k_clock struct clock_posix_cpu which has all the posix-cpu-timer functions filled in. After the cleanup the functions can be made static. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.841974553@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	2fd1f04089	posix-timers: Cleanup struct initializers Cosmetic. No functional change Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.745627057@linutronix.de>	2011-02-02 15:28:12 +01:00
Thomas Gleixner	65da528d7c	posix-timers: Define nanosleep not supported error separate Define the conditional nanosleep not supported error value outside of do_posix_clock_nonanosleep(). Preparatory patch for further cleanups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <johnstul@us.ibm.com> Tested-by: Richard Cochran <richard.cochran@omicron.at> LKML-Reference: <20110201134417.643486574@linutronix.de>	2011-02-02 15:28:11 +01:00
Richard Cochran	1e6d767924	time: Correct the settime parameters Both settimeofday() and clock_settime() promise with a 'const' attribute not to alter the arguments passed in. This patch adds the missing 'const' attribute into the various kernel functions implementing these calls. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134417.545698637@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:11 +01:00
Lucian Adrian Grijincu	4916ca401e	security: remove unused security_sysctl hook The only user for this hook was selinux. sysctl routes every call through /proc/sys/. Selinux and other security modules use the file system checks for sysctl too, so no need for this hook any more. Signed-off-by: Lucian Adrian Grijincu <lucian.grijincu@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>	2011-02-01 11:54:02 -05:00
Thomas Gleixner	7cf37e87dd	time: Fix legacy arch fallout The xtime/dotimer cleanup broke architectures which do not implement clockevents. Time to send out another __do_IRQ threat. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Ingo Molnar <mingo@elte.hu> Cc: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145905.23248.30458.stgit@localhost> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-01 09:46:47 +01:00
Torben Hohn	e2830b5c1b	time: Make do_timer() and xtime_lock local to kernel/time/ All callers of do_timer() are converted to xtime_update(). The only users of xtime_lock are in kernel/time/. Make both local to kernel/time/ and remove them from the global header files. [ tglx: Reuse tick-internal.h instead of creating another local header file. Massaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 19:26:50 +01:00
Thomas Gleixner	51563cd53c	Merge branch 'tip/rtmutex' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into core/locking *git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace tip/rtmutex: rtmutex: Simplify PI algorithm and make highest prio task get lock	2011-01-31 15:09:14 +01:00
Torben Hohn	f0af911a9d	time: Provide xtime_update() xtime_update() takes xtime_lock write locked and calls do_timer(). Provided to replace the do_timer() calls in the architecture code. Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145910.23248.21379.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:43 +01:00
Thomas Gleixner	79ecaf0d15	time: Remove unused __get_wall_to_monotonic() No users left. Remove it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:43 +01:00
Torben Hohn	48cf76f710	time: Provide get_xtime_and_monotonic_offset() The hrtimer code accesses timekeeping variables under xtime_lock. Provide a sensible accessor function and use it. [ tglx: Removed the conditionals, unused variable, fixed codingstyle and massaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145905.23248.30458.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:42 +01:00
Torben Hohn	fbad1ea941	time: Move get_jiffies_64 to kernel/time/jiffies.c Move the jiffies access functions to the jiffies clocksource code. [ tglx: Add missing include ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145900.23248.73352.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:42 +01:00
Torben Hohn	871cf1e5f2	time: Move do_timer() to kernel/time/timekeeping.c do_timer() is primary timekeeping related. calc_global_load() is called from do_timer() as well, but that's more for historical reasons. [ tglx: Fixed up the calc_global_load() reject andmassaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145855.23248.56933.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-31 14:55:41 +01:00
Marcin Slusarz	9ffdc6c37d	watchdog: Don't change watchdog state on read of sysctl Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ add {}'s to fix a warning ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:43 +01:00
Marcin Slusarz	397357666d	watchdog: Fix sysctl consistency If it was not possible to enable watchdog for any cpu, switch watchdog_enabled back to 0, because it's visible via kernel.watchdog sysctl. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:43 +01:00
Marcin Slusarz	4135038a58	watchdog: Fix broken nowatchdog logic Passing nowatchdog to kernel disables 2 things: creation of watchdog threads AND initialization of percpu watchdog_hrtimer. As hrtimers are initialized only at boot it's not possible to enable watchdog later - for me all watchdog threads started to eat 100% of CPU time, but they could just crash. Additionally, even if these threads would start properly, watchdog_disable_all_cpus was guarded by no_watchdog check, so you couldn't disable watchdog. To fix this, remove no_watchdog variable and use already existing watchdog_enabled variable. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ removed another no_watchdog instance ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 13:22:42 +01:00
Thomas Gleixner	1fb0ef31f4	genirq: Fix affinity notifier fallout The new code of commit cd7eab44e(genirq: Add IRQ affinity notifiers) references irq_desc.affinity which fails to compile with CONFIG_GENERIC_HARDIRQS_NO_DEPRECATED=y. Use irq_desc.irq_data.affinity instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ben Hutchings <bhutchings@solarflare.com>	2011-01-31 08:57:41 +01:00
Kacper Kornet	aa5bd67dcf	Fix prlimit64 for suid/sgid processes Since check_prlimit_permission always fails in the case of SUID/GUID processes, such processes are not able to read or set their own limits. This commit changes this by assuming that process can always read/change its own limits. Signed-off-by: Kacper Kornet <kornet@camk.edu.pl> Acked-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-31 13:01:27 +10:00
Lai Jiangshan	8161239a8b	rtmutex: Simplify PI algorithm and make highest prio task get lock In current rtmutex, the pending owner may be boosted by the tasks in the rtmutex's waitlist when the pending owner is deboosted or a task in the waitlist is boosted. This boosting is unrelated, because the pending owner does not really take the rtmutex. It is not reasonable. Example. time1: A(high prio) onwers the rtmutex. B(mid prio) and C (low prio) in the waitlist. time2 A release the lock, B becomes the pending owner A(or other high prio task) continues to run. B's prio is lower than A, so B is just queued at the runqueue. time3 A or other high prio task sleeps, but we have passed some time The B and C's prio are changed in the period (time2 ~ time3) due to boosting or deboosting. Now C has the priority higher than B. **Is it reasonable that C has to boost B and help B to get the rtmutex? NO!! I think, it is unrelated/unneed boosting before B really owns the rtmutex. We should give C a chance to beat B and win the rtmutex. This is the motivation of this patch. This patch ensures* only the top waiter or higher priority task can take the lock. How? 1) we don't dequeue the top waiter when unlock, if the top waiter is changed, the old top waiter will fail and go to sleep again. 2) when requiring lock, it will get the lock when the lock is not taken and: there is no waiter OR higher priority than waiters OR it is top waiter. 3) In any time, the top waiter is changed, the top waiter will be woken up. The algorithm is much simpler than before, no pending owner, no boosting for pending owner. Other advantage of this patch: 1) The states of a rtmutex are reduced a half, easier to read the code. 2) the codes become shorter. 3) top waiter is not dequeued until it really take the lock: they will retain FIFO when it is stolen. Not advantage nor disadvantage 1) Even we may wakeup multiple waiters(any time when top waiter changed), we hardly cause "thundering herd", the number of wokenup task is likely 1 or very little. 2) two APIs are changed. rt_mutex_owner() will not return pending owner, it will return NULL when the top waiter is going to take the lock. rt_mutex_next_owner() always return the top waiter. will not return NULL if we have waiters because the top waiter is not dequeued. I have fixed the code that use these APIs. need updated after this patch is accepted 1) Document/* 2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-01-27 21:13:51 -05:00
Linus Torvalds	bffb276fff	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages sched: Fix/remove redundant cfs_rq checks sched: Fix sign under-flows in wake_affine	2011-01-28 06:45:04 +10:00
Eric Dumazet	88d4f0db7f	perf: Fix alloc_callchain_buffers() Commit `927c7a9e92` ("perf: Fix race in callchains") introduced a mismatch in the sizing of struct callchain_cpus_entries. nr_cpu_ids must be used instead of num_possible_cpus(), or we might get out of bound memory accesses on some machines. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Stephane Eranian <eranian@google.com> CC: stable@kernel.org LKML-Reference: <1295980851.3588.351.camel@edumazet-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-27 19:21:50 +01:00
Peter Zijlstra	6ea72f1206	sched: Avoid expensive initial update_cfs_load(), on UP too Fix the build on UP. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-27 12:48:14 +01:00
Thomas Gleixner	10389a15e2	cred: Replace deprecated spinlock initialization SPIN_LOCK_UNLOCK is deprecated. Use the lockdep capable variant instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:30:37 +01:00
Thomas Gleixner	f97b12cce6	Merge commit 'v2.6.38-rc2' into core/locking Reason: Update to mainline before adding the locking cleanup Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:29:37 +01:00
Arnd Bergmann	ccaa8d6571	rtmutex-tester: Remove BKL tests The BKL is going away, no need to test it any more. I left the definitions of the test case numbers in, so that the other tests do not get renumbered. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1295993854-4971-19-git-send-email-arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-26 15:58:12 +01:00
Peter Zijlstra	da7a735e51	sched: Fix switch_from_fair() When a task is taken out of the fair class we must ensure the vruntime is properly normalized because when we put it back in it will assume to be normalized. The case that goes wrong is when changing away from the fair class while sleeping. Sleeping tasks have non-normalized vruntime in order to make sleeper-fairness work. So treat the switch away from fair as a wakeup and preserve the relative vruntime. Also update sysrq-n to call the ->switch_{to,from} methods. Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Peter Zijlstra	a8941d7ec8	sched: Simplify the idle scheduling class Since commit `48c5ccae88` (sched: Simplify cpu-hot-unplug task migration) this should no longer happen, so remove the code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Venkatesh Pallipadi	414bee9ba6	softirqs: Account ksoftirqd time as cpustat softirq softirq time in ksoftirqd context is not accounted in ns granularity per cpu softirq stats, as we want that to be a part of ksoftirqd exec_runtime. Accounting them as softirq on /proc/stat separately. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:22 +01:00
Venkatesh Pallipadi	abb74cefa9	sched: Export ns irqtimes through /proc/stat CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU. This info is already used in scheduler to do proper task chargeback (earlier patches). This patch retro-fits this ns granularity hardirq and softirq information to /proc/stat irq and softirq fields. The update is still done on timer tick, where we look at accumulated ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest accordingly. No new interface added. Earlier versions looked at adding this as new fields in some /proc files. This one seems to be the best in terms of impact to existing apps, even though it has somewhat more kernel code than earlier versions. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:21 +01:00
Venkatesh Pallipadi	70a89a6620	sched: Refactor account_system_time separating id-update Refactor account_system_time, to separate out the logic of identifying the update needed and code that does actual update. This is used by following patch for IRQ_TIME_ACCOUNTING, which has different identification logic and same update logic. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:21 +01:00
Venkatesh Pallipadi	a1dabb6bff	time: Add nsecs_to_cputime64 interface for asm-generic Add nsecs_to_cputime64 interface. This is used in following patches that updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:20 +01:00
Venkatesh Pallipadi	4dd53d891c	softirqs: Free up pf flag PF_KSOFTIRQD Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer instead, as suggested by Eric Dumazet. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:20 +01:00
Paul Turner	f07333bf6e	sched: Avoid expensive initial update_cfs_load() Since cfs->{load_stamp,load_last} are zero-initalized the initial load update will consider the delta to be 'since the beginning of time'. This results in a lot of pointless divisions to bring this large period to be within the sysctl_sched_shares_window. Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this allows for an initial load_stamp > load_last which then lets standard idle truncation proceed. We avoid spinning (and slightly improve consistency) by fixing delta to be [period - 1] in this path resulting in a slightly more predictable shares ramp. (Previously the amount of idle time preserved by the overflow would range between [period/2,period-1].) Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:19 +01:00
Paul Turner	6d5ab2932a	sched: Simplify update_cfs_shares parameters Re-visiting this: Since update_cfs_shares will now only ever re-weight an entity that is a relative parent of the current entity in enqueue_entity; we can safely issue the account_entity_enqueue relative to that cfs_rq and avoid the requirement for special handling of the enqueue case in update_cfs_shares. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.915214637@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:33:19 +01:00
Paul Turner	05ca62c6ca	sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages The delta in clock_task is a more fair attribution of how much time a tg has been contributing load to the current cpu. While not really important it also means we're more in sync (by magnitude) with respect to periodic updates (since __update_curr deltas are clock_task based). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.007092349@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:03 +01:00
Paul Turner	b815f1963e	sched: Fix/remove redundant cfs_rq checks Since updates are against an entity's queuing cfs_rq it's not possible to enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load would crash prior to the check if we did anyway since we load is examined during the initializers). Also, in the update_cfs_load case there's no point in maintaining averages for rq->cfs_rq since we don't perform shares distribution at that level -- NULL check is replaced accordingly. Thanks to Dan Carpenter for pointing out the deference before NULL check. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.825284940@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:02 +01:00
Paul Turner	e37b6a7b27	sched: Fix sign under-flows in wake_affine While care is taken around the zero-point in effective_load to not exceed the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0) for (load + effective_load) to underflow. In this case the comparing the unsigned values can result in incorrect balanced decisions. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.734245014@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 12:31:01 +01:00
Linus Torvalds	6fb1b30425	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: wacom - pass touch resolution to clients through input_absinfo Input: wacom - add 2 Bamboo Pen and touch models Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup Input: tegra-kbc - add tegra keyboard driver Input: gpio_keys - switch to using request_any_context_irq Input: serio - allow registered drivers to get status flag Input: ct82710c - return proper error code for ct82c710_open Input: bu21013_ts - added regulator support Input: bu21013_ts - remove duplicate resolution parameters Input: tnetv107x-ts - don't treat NULL clk as an error Input: tnetv107x-keypad - don't treat NULL clk as an error Fix up trivial conflicts in drivers/input/keyboard/Makefile due to additions of tc3589x/Tegra drivers	2011-01-26 16:31:44 +10:00
Torben Hohn	ac751efa6a	console: rename acquire/release_console_sem() to console_lock/unlock() The -rt patches change the console_semaphore to console_mutex. As a result, a quite large chunk of the patches changes all acquire/release_console_sem() to acquire/release_console_mutex() This commit makes things use more neutral function names which dont make implications about the underlying lock. The only real change is the return value of console_trylock which is inverted from try_acquire_console_sem() This patch also paves the way to switching console_sem from a semaphore to a mutex. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Thomas Gleixner <tglx@tglx.de> Cc: Greg KH <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:50:06 +10:00
David S. Miller	c2df88cbb4	Merge branch 'irq/numa' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip	2011-01-24 14:30:13 -08:00
David S. Miller	5bdc22a565	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c	2011-01-24 14:09:35 -08:00

... 3 4 5 6 7 ...

11406 Commits