linux_old1

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	90d3417a3a	Merge branch 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: cleanup comments, remove noinline module: group post-relocation functions into post_relocation() module: move module args strndup_user to just before use module: pass load_info into other functions module: fix sysfs cleanup for !CONFIG_SYSFS module: sysfs cleanup module: layout_and_allocate module: fix crash in get_ksymbol() when oopsing in module init module: kallsyms functions take struct load_info module: refactor out section header rewriting: FIX modversions module: refactor out section header rewriting module: add load_info module: reduce stack usage for each_symbol() module: refactor load_module part 5 module: refactor load_module part 4 module: refactor load_module part 3 module: refactor load_module part 2 module: refactor load_module module: module_unload_init() cleanup	2010-08-05 13:49:55 -07:00
Linus Torvalds	cdd854bc42	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (79 commits) powerpc/8xx: Add support for the MPC8xx based boards from TQC powerpc/85xx: Introduce support for the Freescale P1022DS reference board powerpc/85xx: Adding DTS for the STx GP3-SSA MPC8555 board powerpc/85xx: Change deprecated binding for 85xx-based boards powerpc/tqm85xx: add a quirk for ti1520 PCMCIA bridge powerpc/tqm85xx: update PCI interrupt-map attribute powerpc/mpc8308rdb: support for MPC8308RDB board from Freescale powerpc/fsl_pci: add quirk for mpc8308 pcie bridge powerpc/85xx: Cleanup QE initialization for MPC85xxMDS boards powerpc/85xx: Fix booting for P1021MDS boards powerpc/85xx: Fix SWIOTLB initalization for MPC85xxMDS boards powerpc/85xx: kexec for SMP 85xx BookE systems powerpc/5200/i2c: improve i2c bus error recovery of/xilinxfb: update tft compatible versions powerpc/fsl-diu-fb: Support setting display mode using EDID powerpc/5121: doc/dts-bindings: update doc of FSL DIU bindings powerpc/5121: shared DIU framebuffer support powerpc/5121: move fsl-diu-fb.h to include/linux powerpc/5121: fsl-diu-fb: fix issue with re-enabling DIU area descriptor powerpc/512x: add clock structure for Video-IN (VIU) unit ...	2010-08-05 09:03:46 -07:00
Linus Torvalds	c3d1f1746b	Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (150 commits) MIPS: PowerTV: Separate PowerTV USB support from non-USB code MIPS: strip the un-needed sections of vmlinuz MIPS: Clean up the calculation of VMLINUZ_LOAD_ADDRESS MIPS: Clean up arch/mips/boot/compressed/decompress.c MIPS: Clean up arch/mips/boot/compressed/ld.script MIPS: Unify the suffix of compressed vmlinux.bin MIPS: PowerTV: Add Gaia platform definitions. MIPS: BCM47xx: Fix nvram_getenv return value. MIPS: Octeon: Allow more than 3.75GB of memory with PCIe MIPS: Clean up notify_die() usage. MIPS: Remove unused task_struct.trap_no field. Documentation: Mention that KProbes is supported on MIPS SAMPLES: kprobe_example: Make it print something on MIPS. MIPS: kprobe: Add support. MIPS: Add instrunction format for BREAK and SYSCALL MIPS: kprobes: Define regs_return_value() MIPS: Ritually kill stupid printk. MIPS: Octeon: Disallow MSI-X interrupt and fall back to MSI interrupts. MIPS: Octeon: Support 256 MSI on PCIe MIPS: Decode core number for R2 CPUs. ...	2010-08-05 08:53:20 -07:00
Jason Wessel	81d4450732	vt,console,kdb: automatically set kdb LINES variable The kernel console interface stores the number of lines it is configured to use. The kdb debugger can greatly benefit by knowing how many lines there are on the console for the pager functionality without having the end user compile in the setting or have to repeatedly change it at run time. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> CC: David Airlie <airlied@linux.ie> CC: Andrew Morton <akpm@linux-foundation.org>	2010-08-05 09:22:30 -05:00
Jason Wessel	3fa43aba08	debug_core,kdb: fix crash when arch does not have single step When an arch such as mips and microblaze does not implement either HW or software single stepping the debug core should re-enter kdb. The kdb code will properly ignore the single step operation. Attempting to single step the kernel without software or hardware support causes unpredictable kernel crashes. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:25 -05:00
Jason Wessel	19063c776f	ftrace,kdb: Allow dumping a specific cpu's buffer with ftdump In systems with more than one processor it is desirable to look at the per cpu trace buffers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com>	2010-08-05 09:22:23 -05:00
Jason Wessel	955b61e597	ftrace,kdb: Extend kdb to be able to dump the ftrace buffer Add in a helper function to allow the kdb shell to dump the ftrace buffer. Modify trace.c to expose the capability to iterate over the ftrace buffer in a read only capacity. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> CC: Frederic Weisbecker <fweisbec@gmail.com>	2010-08-05 09:22:23 -05:00
Jason Wessel	6d855b1d83	gdbstub: do not directly use dbg_reg_def[] in gdb_cmd_reg_set() Presently the usable registers definitions on x86 are not contiguous for kgdb. The x86 kgdb uses a case statement for the sparse register accesses. The array which defines the registers (dbg_reg_def) should not be used directly in order to safely work with sparse register definitions. Specifically there was a problem when gdb accesses ORIG_AX, which is accessed only through the case statement. This patch encodes register memory using the size information provided from the debugger which avoids the need to look up the size of the register. The dbg_set_reg() function always further validates the inputs from the debugger. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>	2010-08-05 09:22:22 -05:00
Jason Wessel	55751145dc	gdbstub: Implement gdbserial 'p' and 'P' packets The gdbserial 'p' and 'P' packets allow gdb to individually get and set registers instead of querying for all the available registers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:21 -05:00
Jason Wessel	534af10823	kgdb,kdb: individual register set and and get API The kdb shell specification includes the ability to get and set architecture specific registers by name. For the time being individual register get and set will be implemented on a per architecture basis. If an architecture defines DBG_MAX_REG_NUM > 0 then kdb and the gdbstub will use the capability for individually getting and setting architecture specific registers. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:20 -05:00
Jason Wessel	84a0bd5b28	gdbstub: Optimize kgdb's "thread:" response for the gdb serial protocol The gdb debugger understands how to parse short versions of the thread reference string as long as the bytes are paired in sets of two characters. The kgdb implementation was always sending 8 leading zeros which could be omitted, and further optimized in the case of non-negative thread numbers. The negative numbers are used to reference a specific cpu in the case of kgdb. An example of the previous i386 stop packet looks like: T05thread:00000000000003bb; New stop packet response: T05thread:03bb; The previous ThreadInfo response looks like: m00000000fffffffe,0000000000000001,0000000000000002,0000000000000003,0000000000000004,0000000000000005,0000000000000006,0000000000000007,000000000000000c,0000000000000088,000000000000008a,000000000000008b,000000000000008c,000000000000008d,000000000000008e,00000000000000d4,00000000000000d5,00000000000000dd New ThreadInfo response: mfffffffe,01,02,03,04,05,06,07,0c,88,8a,8b,8c,8d,8e,d4,d5,dd A few bytes saved means better response time when using kgdb over a serial line. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:19 -05:00
Andy Shevchenko	a9fa20a7af	kgdb: remove custom hex_to_bin()implementation Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-08-05 09:22:19 -05:00
Kevin Cernekee	034260d677	printk: fix delayed messages from CPU hotplug events When a secondary CPU is being brought up, it is not uncommon for printk() to be invoked when cpu_online(smp_processor_id()) == 0. The case that I witnessed personally was on MIPS: http://lkml.org/lkml/2010/5/30/4 If (can_use_console() == 0), printk() will spool its output to log_buf and it will be visible in "dmesg", but that output will NOT be echoed to the console until somebody calls release_console_sem() from a CPU that is online. Therefore, the boot time messages from the new CPU can get stuck in "limbo" for a long time, and might suddenly appear on the screen when a completely unrelated event (e.g. "eth0: link is down") occurs. This patch modifies the console code so that any pending messages are automatically flushed out to the console whenever a CPU hotplug operation completes successfully or aborts. The issue was seen on 2.6.34. Original patch by Kevin Cernekee with cleanups by akpm and additional fixes by Santosh Shilimkar. This patch superseeds https://patchwork.linux-mips.org/patch/1357/. Signed-off-by: Kevin Cernekee <cernekee@gmail.com> To: <mingo@elte.hu> To: <akpm@linux-foundation.org> To: <simon.kagstrom@netinsight.net> To: <David.Woodhouse@intel.com> To: <lethal@linux-sh.org> Cc: <linux-kernel@vger.kernel.org> Cc: <linux-mips@linux-mips.org> Reviewed-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Patchwork: https://patchwork.linux-mips.org/patch/1534/ LKML-Reference: <ede63b5a20af951c755736f035d1e787772d7c28@localhost> LKML-Reference: <EAF47CD23C76F840A9E7FCE10091EFAB02C5DB6D1F@dbde02.ent.ti.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>	2010-08-05 13:25:59 +01:00
Ingo Molnar	0bcfe75807	Merge branch 'sched/urgent' into sched/core Conflicts: include/linux/sched.h Merge reason: Add the leftover .35 urgent bits, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-05 09:46:29 +02:00
Ingo Molnar	fc9ea5a1e5	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-08-05 08:46:15 +02:00
Ingo Molnar	61be7fdec2	Merge branch 'perf/nmi' into perf/core Conflicts: kernel/Makefile Merge reason: Add the now complete topic, fix the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-05 08:45:05 +02:00
Rusty Russell	51f3d0f474	module: cleanup comments, remove noinline On my (32-bit x86) machine, sys_init_module() uses 124 bytes of stack once load_module() is inlined. This effectively reverts `ffb4ba76` which inlined it due to stack pressure. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:13 +09:30
Rusty Russell	811d66a0e1	module: group post-relocation functions into post_relocation() This simply hoists more code out of load_module; we also put the identification of the extable and dynamic debug table in with the others in find_module_sections(). We move the taint check to the actual add/remove of the dynamic debug info: this is certain (find_module_sections is too early). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Yehuda Sadeh <yehuda@hq.newdream.net>	2010-08-05 12:59:13 +09:30
Rusty Russell	6526c534b2	module: move module args strndup_user to just before use Instead of copying and allocating the args and storing it in load_info, we can just allocate them right before we need them. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:12 +09:30
Rusty Russell	49668688dd	module: pass load_info into other functions Pass the struct load_info into all the other functions in module loading. This neatens things and makes them more consistent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:10 +09:30
Rusty Russell	36b0360d17	module: fix sysfs cleanup for !CONFIG_SYSFS Restore the stub module_remove_modinfo_attrs, remove the now-unused !CONFIG_SYSFS module_sysfs_init. Also, rename mod_kobject_remove() to mod_sysfs_teardown() as it is the logical counterpart to mod_sysfs_setup now. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:10 +09:30
Rusty Russell	8f6d037815	module: sysfs cleanup We change the sysfs functions to take struct load_info, and call them all in mod_sysfs_setup(). We also clean up the #ifdefs a little. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:09 +09:30
Rusty Russell	d913188c75	module: layout_and_allocate layout_and_allocate() does everything up to and including the final struct module placement inside the allocated module memory. We have to store the symbol layout information in our struct load_info though. This avoids the nasty code we had before where 'mod' pointed first to the version inside the temporary allocation containing the entire file, then later was moved to point to the real struct module: now the main code only ever sees the final module address. (Includes fix for the Tony Luck-found Linus-diagnosed failure path error). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:09 +09:30
Rusty Russell	511ca6ae43	module: fix crash in get_ksymbol() when oopsing in module init Andrew had the sole pleasure of tickling this bug in linux-next; when we set up "info->strtab" it's pointing into the temporary copy of the module. For most uses that is fine, but kallsyms keeps a pointer around during module load (inside mod->strtab). If we oops for some reason inside a module's init function, kallsyms will use the mod->strtab pointer into the now-freed temporary module copy. (Later oopses work fine: after init we overwrite mod->strtab to point to a compacted core-only strtab). Reported-by: Andrew "Grumpy" Morton <akpm@linux-foundation.org> Signed-off-by: Rusty "Buggy" Russell <rusty@rustcorp.com.au> Tested-by: Andrew "Happy" Morton <akpm@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:08 +09:30
Rusty Russell	eded41c1c6	module: kallsyms functions take struct load_info Simple refactor causes us to lift struct definition to top of file. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:08 +09:30
Rusty Russell	d6df72a06e	module: refactor out section header rewriting: FIX modversions We can't do the find_sec after removing the SHF_ALLOC flags; it won't find the sections. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:07 +09:30
Rusty Russell	8b5f61a795	module: refactor out section header rewriting Put all the "rewrite and check section headers" in one place. This adds another iteration over the sections, but it's far clearer. We iterate once for every find_section() so we already iterate over many times. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Linus Torvalds	3264d3f9dd	module: add load_info Btw, here's a patch that _looks_ large, but it really pretty trivial, and sets things up so that it would be way easier to split off pieces of the module loading. The reason it looks large is that it creates a "module_info" structure that contains all the module state that we're building up while loading, instead of having individual variables for all the indices etc. So the patch ends up being large, because every "symindex" access instead becomes "info.index.sym" etc. That may be a few characters longer, but it then means that we can just pass a pointer to that "info" structure around. and let all the pieces fill it in very naturally. As an example of that, the patch also moves the initialization of all those convenience variables into a "setup_module_info()" function. And at this point it really does become very natural to start to peel off some of the error labels and move them into the helper functions - now the "truncated" case is gone, and is handled inside that setup function instead. So maybe you don't like this approach, and it does make the variable accesses a bit longer, but I don't think unreadably so. And the patch really does look big and scary, but there really should be absolutely no semantic changes - most of it was a trivial and mindless rename. In fact, it was so mindless that I on purpose kept the existing helper functions looking like this: - err = check_modinfo(mod, sechdrs, infoindex, versindex); + err = check_modinfo(mod, info.sechdrs, info.index.info, info.index.vers); rather than changing them to just take the "info" pointer. IOW, a second phase (if you think the approach is ok) would change that calling convention to just do err = check_modinfo(mod, &info); (and same for "layout_sections()", "layout_symtabs()" etc.) Similarly, while right now it makes things _look_ bigger, with things like this: versindex = find_sec(hdr, sechdrs, secstrings, "__versions"); becoming info->index.vers = find_sec(info->hdr, info->sechdrs, info->secstrings, "__versions"); in the new "setup_module_info()" function, that's again just a result of it being a search-and-replace patch. By using the 'info' pointer, we could just change the 'find_sec()' interface so that it ends up being info->index.vers = find_sec(info, "__versions"); instead, and then we'd actually have a shorter and more readable line. So for a lot of those mindless variable name expansions there's would be room for separate cleanups. I didn't move quite everything in there - if we do this to layout_symtabs, for example, we'd want to move the percpu, symoffs, stroffs, *strmap variables to be fields in that module_info structure too. But that's a much smaller patch, I moved just the really core stuff that is currently being set up and used in various parts. But even in this rough form, it removes close to 70 lines from that function (but adds 22 lines overall, of course - the structure definition, the helper function declarations and call-sites etc etc). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Linus Torvalds	44032e6316	module: reduce stack usage for each_symbol() And now that I'm looking at that call-chain (to see if it would make sense to use some other more specific lock - doesn't look like it: all the readers are using RCU and this is the only writer), I also give you this trivial one-liner. It changes each_symbol() to not put that constant array on the stack, resulting in changing movq $C.388.31095, %rsi #, tmp85 subq $376, %rsp #, movq %rdi, %rbx # fn, fn leaq -208(%rbp), %rdi #, tmp84 movq %rbx, %rdx # fn, rep movsl xorl %esi, %esi # leaq -208(%rbp), %rdi #, tmp87 movq %r12, %rcx # data, call each_symbol_in_section.clone.0 # into xorl %esi, %esi # subq $216, %rsp #, movq %rdi, %rbx # fn, fn movq $arr.31078, %rdi #, call each_symbol_in_section.clone.0 # which is not so much about being obviously shorter and simpler because we don't unnecessarily copy that constant array around onto the stack, but also about having a much smaller stack footprint (376 vs 216 bytes - see the update of 'rsp'). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:06 +09:30
Rusty Russell	22e268ebec	module: refactor load_module part 5 1) Extract out the relocation loop into apply_relocations 2) Extract license and version checks into check_module_license_and_versions 3) Extract icache flushing into flush_module_icache 4) Move __obsparm warning into find_module_sections 5) Move license setting into check_modinfo. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:05 +09:30
Rusty Russell	9f85a4bbb1	module: refactor load_module part 4 Allocate references inside module_unload_init(), clean up inside module_unload_free(). This version fixed to do allocation before __this_cpu_write, thanks to bug reports from linux-next from Dave Young <hidave.darkstar@gmail.com> and Stephen Rothwell <sfr@canb.auug.org.au>. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:05 +09:30
Rusty Russell	40dd2560ec	module: refactor load_module part 3 Extract out the allocation and copying in from userspace, and the first set of modinfo checks. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:04 +09:30
Linus Torvalds	65b8a9b4d5	module: refactor load_module part 2 Here's a second one. It's slightly less trivial - since we now have error cases - and equally untested so it may well be totally broken. But it also cleans up a bit more, and avoids one of the goto targets, because the "move_module()" helper now does both allocations or none. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:03 +09:30
Linus Torvalds	f91a13bb99	module: refactor load_module I'd start from the trivial stuff. There's a fair amount of straight-line code that just makes the function hard to read just because you have to page up and down so far. Some of it is trivial to just create a helper function for. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2010-08-05 12:59:02 +09:30
Eric Dumazet	2409e74278	module: module_unload_init() cleanup No need to clear mod->refptr in module_unload_init(), since alloc_percpu() already clears allocated chunks. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed unused var)	2010-08-05 12:59:02 +09:30
Linus Torvalds	3cfc2c42c1	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (48 commits) Documentation: update broken web addresses. fix comment typo "choosed" -> "chosen" hostap:hostap_hw.c Fix typo in comment Fix spelling contorller -> controller in comments Kconfig.debug: FAIL_IO_TIMEOUT: typo Faul -> Fault fs/Kconfig: Fix typo Userpace -> Userspace Removing dead MACH_U300_BS26 drivers/infiniband: Remove unnecessary casts of private_data fs/ocfs2: Remove unnecessary casts of private_data libfc: use ARRAY_SIZE scsi: bfa: use ARRAY_SIZE drm: i915: use ARRAY_SIZE drm: drm_edid: use ARRAY_SIZE synclink: use ARRAY_SIZE block: cciss: use ARRAY_SIZE comment typo fixes: charater => character fix comment typos concerning "challenge" arm: plat-spear: fix typo in kerneldoc reiserfs: typo comment fix update email address ...	2010-08-04 15:31:02 -07:00
Linus Torvalds	b7c8e55db7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (39 commits) random: Reorder struct entropy_store to remove padding on 64bits padata: update API documentation padata: Remove padata_get_cpumask crypto: pcrypt - Update pcrypt cpumask according to the padata cpumask notifier crypto: pcrypt - Rename pcrypt_instance padata: Pass the padata cpumasks to the cpumask_change_notifier chain padata: Rearrange set_cpumask functions padata: Rename padata_alloc functions crypto: pcrypt - Dont calulate a callback cpu on empty callback cpumask padata: Check for valid cpumasks padata: Allocate cpumask dependend recources in any case padata: Fix cpu index counting crypto: geode_aes - Convert pci_table entries to PCI_VDEVICE (if PCI_ANY_ID is used) pcrypt: Added sysfs interface to pcrypt padata: Added sysfs primitives to padata subsystem padata: Make two separate cpumasks padata: update documentation padata: simplify serialization mechanism padata: make padata_do_parallel to return zero on success padata: Handle empty padata cpumasks ...	2010-08-04 15:23:14 -07:00
Linus Torvalds	6ba74014c1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1443 commits) phy/marvell: add 88ec048 support igb: Program MDICNFG register prior to PHY init e1000e: correct MAC-PHY interconnect register offset for 82579 hso: Add new product ID can: Add driver for esd CAN-USB/2 device l2tp: fix export of header file for userspace can-raw: Fix skb_orphan_try handling Revert "net: remove zap_completion_queue" net: cleanup inclusion phy/marvell: add 88e1121 interface mode support u32: negative offset fix net: Fix a typo from "dev" to "ndev" igb: Use irq_synchronize per vector when using MSI-X ixgbevf: fix null pointer dereference due to filter being set for VLAN 0 e1000e: Fix irq_synchronize in MSI-X case e1000e: register pm_qos request on hardware activation ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice net: Add getsockopt support for TCP thin-streams cxgb4: update driver version cxgb4: add new PCI IDs ... Manually fix up conflicts in: - drivers/net/e1000e/netdev.c: due to pm_qos registration infrastructure changes - drivers/net/phy/marvell.c: conflict between adding 88ec048 support and cleaning up the IDs - drivers/net/wireless/ipw2x00/ipw2100.c: trivial ipw2100_pm_qos_req conflict (registration change vs marking it static)	2010-08-04 11:47:58 -07:00
David Howells	694f690d27	CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks Commit `8f92054e7c` ("CRED: Fix __task_cred()'s lockdep check and banner comment") fixed the lockdep checks on __task_cred(). This has shown up a place in the signalling code where a lock should be held - namely that check_kill_permission() requires its callers to hold the RCU lock. Fix group_send_sig_info() to get the RCU read lock around its call to check_kill_permission(). Without this patch, the following warning can occur: =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/signal.c:660 invoked rcu_dereference_check() without protection! ... Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-08-04 11:17:10 -07:00
Linus Torvalds	f46e9913fa	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Runtime: Add runtime PM statistics (v3) PM / Runtime: Make runtime_status attribute not debug-only (v. 2) PM: Do not use dynamically allocated objects in pm_wakeup_event() PM / Suspend: Fix ordering of calls in suspend error paths PM / Hibernate: Fix snapshot error code path PM / Hibernate: Fix hibernation_platform_enter() pm_qos: Get rid of the allocation in pm_qos_add_request() pm_qos: Reimplement using plists plist: Add plist_last PM: Make it possible to avoid races between wakeup and system sleep PNPACPI: Add support for remote wakeup PM: describe kernel policy regarding wakeup defaults (v. 2) PM / Hibernate: Fix typos in comments in kernel/power/swap.c	2010-08-04 11:14:36 -07:00
Srikar Dronamraju	9da79ab83e	tracing/kprobes: unregister_trace_probe needs to be called under mutex Comment in unregister_trace_probe() says probe_lock will be held when it gets called. However there is a case where it might called without the probe_lock being held. Also since we are traversing the probe_list and deleting an element from the probe_list, probe_lock should be held. This was first pointed in uprobes traceevent review by Frederic Weisbecker here. (http://lkml.org/lkml/2010/5/12/106) Cc: Ingo Molnar <mingo@elte.hu> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <20100630084548.GA10325@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-08-04 12:41:23 -03:00
Jiri Kosina	d790d4d583	Merge branch 'master' into for-next	2010-08-04 15:14:38 +02:00
Patrick Pannuto	5e7f5a178b	timer: Added usleep_range timer usleep_range is a finer precision implementations of msleep and is designed to be a drop-in replacement for udelay where a precise sleep / busy-wait is unnecessary. Since an easy interface to hrtimers could lead to an undesired proliferation of interrupts, we provide only a "range" API, forcing the caller to think about an acceptable tolerance on both ends and hopefully avoiding introducing another interrupt. INTRO As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not precise enough for many drivers (yes, sleep precision is an unfair notion, but consistently sleeping for ~an order of magnitude greater than requested is worth fixing). This patch adds a usleep API so that udelay does not have to be used. Obviously not every udelay can be replaced (those in atomic contexts or being used for simple bitbanging come to mind), but there are many, many examples of mydriver_write(...) /* Wait for hardware to latch / udelay(100) in various drivers where a busy-wait loop is neither beneficial nor necessary, but msleep simply does not provide enough precision and people are using a busy-wait loop instead. CONCERNS FROM THE RFC Why is udelay a problem / necessary? Most callers of udelay are in device/ driver initialization code, which is serial... As I see it, there is only benefit to sleeping over a delay; the notion of "refactoring" areas that use udelay was presented, but I see usleep as the refactoring. Consider i2c, if the bus is busy, you need to wait a bit (say 100us) before trying again, your current options are: udelay(100) * msleep(1) <-- As noted above, actually as high as ~20ms on some platforms, so not really an option * Manually set up an hrtimer to try again in 100us (which is what usleep does anyway...) People choose the udelay route because it is EASY; we need to provide a better easy route. Device / driver / boot code is currently serial, but every few months someone makes noise about parallelizing boot, and IMHO, a little forward-thinking now is one less thing to worry about if/when that ever happens udelay's could be preempted Sure, but if udelay plans on looping 1000 times, and it gets preempted on loop 200, whenever it's scheduled again, it is going to do the next 800 loops. Is the interruptible case needed? Probably not, but I see usleep as a very logical parallel to msleep, so it made sense to include the "full" API. Processors are getting faster (albeit not as quickly as they are becoming more parallel), so if someone wanted to be interruptible for a few usecs, why not let them? If this is a contentious point, I'm happy to remove it. OTHER THOUGHTS I believe there is also value in exposing the usleep_range option; it gives the scheduler a lot more flexibility and allows the programmer to express his intent much more clearly; it's something I would hope future driver writers will take advantage of. To get the results in the NUMBERS section below, I literally s/udelay/usleep the kernel tree; I had to go in and undo the changes to the USB drivers, but everything else booted successfully; I find that extremely telling in and of itself -- many people are using a delay API where a sleep will suit them just fine. SOME ATTEMPTS AT NUMBERS It turns out that calculating quantifiable benefit on this is challenging, so instead I will simply present the current state of things, and I hope this to be sufficient: How many udelay calls are there in 2.6.35-rc5? udealy(ARG) >= \| COUNT 1000 \| 319 500 \| 414 100 \| 1146 20 \| 1832 I am working on Android, so that is my focus for this. The following table is a modified usleep that simply printk's the amount of time requested to sleep; these tests were run on a kernel with udelay >= 20 --> usleep "boot" is power-on to lock screen "power collapse" is when the power button is pushed and the device suspends "resume" is when the power button is pushed and the lock screen is displayed (no touchscreen events or anything, just turning on the display) "use device" is from the unlock swipe to clicking around a bit; there is no sd card in this phone, so fail loading music, video, camera ACTION \| TOTAL NUMBER OF USLEEP CALLS \| NET TIME (us) boot \| 22 \| 1250 power-collapse \| 9 \| 1200 resume \| 5 \| 500 use device \| 59 \| 7700 The most interesting category to me is the "use device" field; 7700us of busy-wait time that could be put towards better responsiveness, or at the least less power usage. Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: apw@canonical.com Cc: corbet@lwn.net Cc: arjan@linux.intel.com Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-04 11:00:45 +02:00
Thomas Gleixner	e1b004c3ef	Revert "timer: Added usleep[_range] timer" This reverts commit `22b8f15c2f` to merge an advanced version. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-04 10:53:00 +02:00
Benjamin Herrenschmidt	412a4ac5e9	Merge commit 'gcl/next' into next	2010-08-04 10:26:03 +10:00
Jesse Barnes	8cadd2831b	timer: add on-stack deferrable timer interfaces In some cases (for instance with kernel threads) it may be desireable to use on-stack deferrable timers to get their power saving benefits. Add interfaces to support this for the IPS driver. Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Matthew Garrett <mjg@redhat.com>	2010-08-03 09:48:45 -04:00
Arjan van de Ven	af5ab277de	clockevents: Remove the per cpu tick skew Historically, Linux has tried to make the regular timer tick on the various CPUs not happen at the same time, to avoid contention on xtime_lock. Nowadays, with the tickless kernel, this contention no longer happens since time keeping and updating are done differently. In addition, this skew is actually hurting power consumption in a measurable way on many-core systems. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20100727210210.58d3118c@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-08-02 21:45:58 +02:00
Ingo Molnar	3772b73472	Merge commit 'v2.6.35' into perf/core Conflicts: tools/perf/Makefile tools/perf/util/hist.c Merge reason: Resolve the conflicts and update to latest upstream. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-08-02 08:31:54 +02:00
Frederic Weisbecker	669336e4cf	perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call We use synchronize_sched() to ensure a tracepoint won't be called while/after we release the perf buffers it references. But the tracepoint API has its own API for that: tracepoint_synchronize_unregister(). Use it instead as it's self-explanatory and eases maintainance. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-08-02 01:30:56 +02:00
Suresh Siddha	6ee0578b4d	workqueue: mark init_workqueues() as early_initcall() Mark init_workqueues() as early_initcall() and thus it will be initialized before smp bringup. init_workqueues() registers for the hotcpu notifier and thus it should cope with the processors that are brought online after the workqueues are initialized. x86 smp bringup code uses workqueues and uses a workaround for the cold boot process (as the workqueues are initialized post smp_init()). Marking init_workqueues() as early_initcall() will pave the way for cleaning up this code. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-08-01 13:05:29 +02:00
Tejun Heo	098849516d	workqueue: explain for_each_cwq_cpu() iterators for_each_cwq_cpu() are similar to regular CPU iterators except that it also considers the pseudo CPU number used for unbound workqueues. Explain them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-08-01 11:50:12 +02:00
Steffen Klassert	0500e9b3f1	padata: Remove padata_get_cpumask A function that copies the padata cpumasks to a user buffer is a bit error prone. The cpumask can change any time so we can't be sure to have the right cpumask when using this function. A user who is interested in the padata cpumasks should register to the padata cpumask notifier chain instead. Users of padata_get_cpumask are already updated, so we can remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:06 +08:00
Steffen Klassert	c635696c7c	padata: Pass the padata cpumasks to the cpumask_change_notifier chain We pass a pointer to the new padata cpumasks to the cpumask_change_notifier chain. So users can access the cpumasks without the need of an extra padata_get_cpumask function. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:05 +08:00
Steffen Klassert	65ff577e6b	padata: Rearrange set_cpumask functions padata_set_cpumask needs to be protected by a lock. We make __padata_set_cpumasks unlocked and static. So this function can be used by the exported and locked padata_set_cpumask and padata_set_cpumasks functions. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:04 +08:00
Steffen Klassert	e6cc117076	padata: Rename padata_alloc functions We rename padata_alloc to padata_alloc_possible because this function allocates a padata_instance and uses the cpu_possible mask for parallel and serial workers. Also we rename __padata_alloc to padata_alloc to avoid to export underlined functions. Underlined functions are considered to be private to padata. Users are updated accordingly. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-31 19:53:04 +08:00
David Howells	de09a9771a	CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-29 15:16:17 -07:00
Ian Campbell	685fd0b4ea	irq: Add new IRQ flag IRQF_NO_SUSPEND A small number of users of IRQF_TIMER are using it for the implied no suspend behaviour on interrupts which are not timer interrupts. Therefore add a new IRQF_NO_SUSPEND flag, rename IRQF_TIMER to __IRQF_TIMER and redefine IRQF_TIMER in terms of these new flags. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: xen-devel@lists.xensource.com Cc: linux-input@vger.kernel.org Cc: linuxppc-dev@ozlabs.org Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1280398595-29708-1-git-send-email-ian.campbell@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-29 13:24:57 +02:00
Thomas Gleixner	157b1a2385	kgdb: Do not access xtime directly The xtime cleanup missed the kgdb access to xtime. Fix it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-29 10:29:39 +02:00
Eric Paris	1968f5eed5	fanotify: use both marks when possible fanotify currently, when given a vfsmount_mark will look up (if it exists) the corresponding inode mark. This patch drops that lookup and uses the mark provided. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:55 -04:00
Eric Paris	ce8f76fb73	fsnotify: pass both the vfsmount mark and inode mark should_send_event() and handle_event() will both need to look up the inode event if they get a vfsmount event. Lets just pass both at the same time since we have them both after walking the lists in lockstep. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:54 -04:00
Eric Paris	43709a288e	fsnotify: remove group->mask group->mask is now useless. It was originally a shortcut for fsnotify to save on performance. These checks are now redundant, so we remove them. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:54 -04:00
Eric Paris	2612abb51b	fsnotify: cleanup should_send_event The change to use srcu and walk the object list rather than the global fsnotify_group list means that should_send_event is no longer needed for a number of groups and can be simplified for others. Do that. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:53 -04:00
Eric Paris	4cd76a4792	audit: use the mark in handler functions audit now gets a mark in the should_send_event and handle_event functions. Rather than look up the mark themselves audit should just use the mark it was handed. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:53 -04:00
Eric Paris	3a9b16b407	fsnotify: send fsnotify_mark to groups in event handling functions With the change of fsnotify to use srcu walking the marks list instead of walking the global groups list we now know the mark in question. The code can send the mark to the group's handling functions and the groups won't have to find those marks themselves. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:52 -04:00
Eric Paris	3bcf3860a4	fsnotify: store struct file not struct path Al explains that calling dentry_open() with a mnt/dentry pair is only garunteed to be safe if they are already used in an open struct file. To make sure this is the case don't store and use a struct path in fsnotify, always use a struct file. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 10:18:51 -04:00
Dave Young	d14f172948	sysctl extern cleanup: inotify Extern declarations in sysctl.c should be move to their own head file, and then include them in relavant .c files. Move inotify_table extern declaration to linux/inotify.h Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:59:01 -04:00
Alexey Dobriyan	6e006701cc	dnotify: move dir_notify_enable declaration Move dir_notify_enable declaration to where it belongs -- dnotify.h . Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:59:01 -04:00
Eric Paris	5444e2981c	fsnotify: split generic and inode specific mark code currently all marking is done by functions in inode-mark.c. Some of this is pretty generic and should be instead done in a generic function and we should only put the inode specific code in inode-mark.c Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:57 -04:00
Eric Paris	bbaa4168b2	fanotify: sys_fanotify_mark declartion This patch simply declares the new sys_fanotify_mark syscall int fanotify_mark(int fanotify_fd, unsigned int flags, u64_mask, int dfd const char *pathname) Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:55 -04:00
Eric Paris	11637e4b7d	fanotify: fanotify_init syscall declaration This patch defines a new syscall fanotify_init() of the form: int sys_fanotify_init(unsigned int flags, unsigned int event_f_flags, unsigned int priority) This syscall is used to create and fanotify group. This is very similar to the inotify_init() syscall. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:55 -04:00
Andreas Gruenbacher	3556608709	fsnotify: take inode->i_lock inside fsnotify_find_mark_entry() All callers to fsnotify_find_mark_entry() except one take and release inode->i_lock around the call. Take the lock inside fsnotify_find_mark_entry() instead. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:54 -04:00
Eric Paris	d07754412f	fsnotify: rename fsnotify_find_mark_entry to fsnotify_find_mark the _entry portion of fsnotify functions is useless. Drop it. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:53 -04:00
Eric Paris	e61ce86737	fsnotify: rename fsnotify_mark_entry to just fsnotify_mark The name is long and it serves no real purpose. So rename fsnotify_mark_entry to just fsnotify_mark. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:53 -04:00
Eric Paris	2823e04de4	fsnotify: put inode specific fields in an fsnotify_mark in a union The addition of marks on vfs mounts will be simplified if the inode specific parts of a mark and the vfsmnt specific parts of a mark are actually in a union so naming can be easy. This patch just implements the inode struct and the union. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:52 -04:00
Eric Paris	3a9fb89f4c	fsnotify: include vfsmount in should_send_event when appropriate To ensure that a group will not duplicate events when it receives it based on the vfsmount and the inode should_send_event test we should distinguish those two cases. We pass a vfsmount to this function so groups can make their own determinations. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:52 -04:00
Eric Paris	0d2e2a1d00	fsnotify: drop mask argument from fsnotify_alloc_group Nothing uses the mask argument to fsnotify_alloc_group. This patch drops that argument. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:51 -04:00
Eric Paris	220d14df0d	Audit: only set group mask when something is being watched Currently the audit watch group always sets a mask equal to all events it might care about. We instead should only set the group mask if we are actually watching inodes. This should be a perf win when audit watches are compiled in. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:51 -04:00
Eric Paris	ffab83402f	fsnotify: fsnotify_obtain_group should be fsnotify_alloc_group fsnotify_obtain_group was intended to be able to find an already existing group. Nothing uses that functionality. This just renames it to fsnotify_alloc_group so it is clear what it is doing. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:50 -04:00
Eric Paris	74be0cc828	fsnotify: remove group_num altogether The original fsnotify interface has a group-num which was intended to be able to find a group after it was added. I no longer think this is a necessary thing to do and so we remove the group_num. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:50 -04:00
Eric Paris	8112e2d6a7	fsnotify: include data in should_send calls fanotify is going to need to look at file->private_data to know if an event should be sent or not. This passes the data (which might be a file, dentry, inode, or none) to the should_send function calls so fanotify can get that information when available Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	7b0a04fbfb	fsnotify: provide the data type to should_send_event fanotify is only interested in event types which contain enough information to open the original file in the context of the fanotify listener. Since fanotify may not want to send events if that data isn't present we pass the data type to the should_send_event function call so fanotify can express its lack of interest. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	2dfc1cae4c	inotify: remove inotify in kernel interface nothing uses inotify in the kernel, drop it! Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:31 -04:00
Eric Paris	1a3aedbce4	Audit: audit watch init should not be before fsnotify init Audit watch init and fsnotify init both use subsys_initcall() but since the audit watch code is linked in before the fsnotify code the audit watch code would be using the fsnotify srcu struct before it was initialized. This patch fixes that problem by moving audit watch init to device_initcall() so it happens after fsnotify is ready. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Eric Paris <eparis@redhat.com> Tested-by : Sachin Sant <sachinp@in.ibm.com>	2010-07-28 09:58:19 -04:00
Eric Paris	939a67fc4c	Audit: split audit watch Kconfig Audit watch should depend on CONFIG_AUDIT_SYSCALL and should select FSNOTIFY. This splits the spagetti like mixing of audit_watch and audit_filter code so they can be configured seperately. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:19 -04:00
Eric Paris	28a3a7eb3b	audit: reimplement audit_trees using fsnotify rather than inotify Simply switch audit_trees from using inotify to using fsnotify for it's inode pinning and disappearing act information. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	40554c3dae	fsnotify: allow addition of duplicate fsnotify marks This patch allows a task to add a second fsnotify mark to an inode for the same group. This mark will be added to the end of the inode's list and this will never be found by the stand fsnotify_find_mark() function. This is useful if a user wants to add a new mark before removing the old one. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	a05fb6cc57	audit: do not get and put just to free a watch deleting audit watch rules is not currently done under audit_filter_mutex. It was done this way because we could not hold the mutex during inotify manipulation. Since we are using fsnotify we don't need to do the extra get/put pair nor do we need the private list on which to store the parents while they are about to be freed. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:17 -04:00
Eric Paris	e118e9c563	audit: redo audit watch locking and refcnt in light of fsnotify fsnotify can handle mutexes to be held across all fsnotify operations since it deals strickly in spinlocks. This can simplify and reduce some of the audit_filter_mutex taking and dropping. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Eric Paris	e9fd702a58	audit: convert audit watches to use fsnotify instead of inotify Audit currently uses inotify to pin inodes in core and to detect when watched inodes are deleted or unmounted. This patch uses fsnotify instead of inotify. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Eric Paris	ae7b8f4108	Audit: clean up the audit_watch split No real changes, just cleanup to the audit_watch split patch which we done with minimal code changes for easy review. Now fix interfaces to make things work better. Signed-off-by: Eric Paris <eparis@redhat.com>	2010-07-28 09:58:16 -04:00
Sridhar Samudrala	d7926ee38f	cgroups: Add an API to attach a task to current task's cgroup Add a new kernel API to attach a task to current task's cgroup in all the active hierarchies. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com>	2010-07-28 15:45:12 +03:00
Jason Baron	b82bab4bbe	dynamic debug: move ddebug_remove_module() down into free_module() The command echo "file ec.c +p" >/sys/kernel/debug/dynamic_debug/control causes an oops. Move the call to ddebug_remove_module() down into free_module(). In this way it should be called from all error paths. Currently, we are missing the remove if the module init routine fails. Signed-off-by: Jason Baron <jbaron@redhat.com> Reported-by: Thomas Renninger <trenn@suse.de> Tested-by: Thomas Renninger <trenn@suse.de> Cc: <stable@kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-27 14:32:06 -07:00
John Stultz	852db46d55	clocksource: Add __clocksource_updatefreq_hz/khz methods To properly handle clocksources that change frequencies at the clocksource->enable() point, this patch adds a method that will update the clocksource's mult/shift and max_idle_ns values. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-12-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	0fb86b0629	timekeeping: Make xtime and wall_to_monotonic static This patch makes xtime and wall_to_monotonic static, as planned in Documentation/feature-removal-schedule.txt. This will allow for further cleanups to the timekeeping core. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-10-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	8ab4351a4c	hrtimer: Cleanup direct access to wall_to_monotonic Provides an accessor function to replace hrtimer.c's direct access of wall_to_monotonic. This will allow wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-9-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:55 +02:00
John Stultz	7615856ebf	timkeeping: Fix update_vsyscall to provide wall_to_monotonic offset update_vsyscall() did not provide the wall_to_monotoinc offset, so arch specific implementations tend to reference wall_to_monotonic directly. This limits future cleanups in the timekeeping core, so this patch fixes the update_vsyscall interface to provide wall_to_monotonic, allowing wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:54 +02:00
John Stultz	592913ecb8	time: Kill off CONFIG_GENERIC_TIME Now that all arches have been converted over to use generic time via clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME config option and simplify the generic code. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:54 +02:00
John Stultz	ce3bf7ab22	time: Implement timespec_add After accidentally misusing timespec_add_safe, I wanted to make sure we don't accidently trip over that issue again, so I created a simple timespec_add() function which we can use to replace the instances of timespec_add_safe() that don't want the overflow detection. Signed-off-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <1279068988-21864-3-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-27 12:40:53 +02:00
Steffen Klassert	7424713b83	padata: Check for valid cpumasks Now that we allow to change the cpumasks from userspace, we have to check for valid cpumasks in padata_do_parallel. This patch adds the necessary check. This fixes a division by zero crash if the parallel cpumask contains no active cpu. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:58 +08:00
Steffen Klassert	b89661dff5	padata: Allocate cpumask dependend recources in any case The cpumask separation work assumes the cpumask dependend recources present regardless of valid or invalid cpumasks. With this patch we allocate the cpumask dependend recources in any case. This fixes two NULL pointer dereference crashes in padata_replace and in padata_get_cpumask. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:58 +08:00
Steffen Klassert	fad3a906d3	padata: Fix cpu index counting The counting of the cpu index got lost with a recent commit. This patch restores it. This fixes a hang of the parallel worker threads on cpu hotplug. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-26 14:13:57 +08:00
Andrey Vagin	2b08de0073	posix_timer: Move copy_to_user(created_timer_id) down in timer_create() According to Oleg Nesterov: We can move copy_to_user(created_timer_id) down after "if (timer_event_spec)" block too. (but before CLOCK_DISPATCH(), of course). Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Andrey Vagin <avagin@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
Patrick Pannuto	22b8f15c2f	timer: Added usleep[_range] timer usleep[_range] are finer precision implementations of msleep and are designed to be drop-in replacements for udelay where a precise sleep / busy-wait is unnecessary. They also allow an easy interface to specify slack when a precise (ish) wakeup is unnecessary to help minimize wakeups Signed-off-by: Patrick Pannuto <ppannuto@codeaurora.org> Cc: akinobu.mita@gmail.com Cc: sboyd@codeaurora.org Acked-by: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <4C44CDD2.1070708@codeaurora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
J. Bruce Fields	866e26115c	timers: Document meaning of deferrable timer Steal some text from `6e453a6751` "Add support for deferrable timers". A reader shouldn't have to dig through the git logs for the basic description of a deferrable timer. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: johnstul@us.ibm.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2010-07-23 15:08:12 +02:00
Tejun Heo	181a51f6e0	slow-work: kill it slow-work doesn't have any user left. Kill it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com>	2010-07-23 13:14:49 +02:00
Ingo Molnar	3a01736e70	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2010-07-23 09:10:29 +02:00
Tejun Heo	e120153ddf	workqueue: fix how cpu number is stored in work->data Once a work starts execution, its data contains the cpu number it was on instead of pointing to cwq. This is added by commit `7a22ad75` (workqueue: carry cpu number in work data once execution starts) to reliably determine the work was last on even if the workqueue itself was destroyed inbetween. Whether data points to a cwq or contains a cpu number was distinguished by comparing the value against PAGE_OFFSET. The assumption was that a cpu number should be below PAGE_OFFSET while a pointer to cwq should be above it. However, on architectures which use separate address spaces for user and kernel spaces, this doesn't hold as PAGE_OFFSET is zero. Fix it by using an explicit flag, WORK_STRUCT_CWQ, to mark what the data field contains. If the flag is set, it's pointing to a cwq; otherwise, it contains a cpu number. Reported on s390 and microblaze during linux-next testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Michal Simek <michal.simek@petalogix.com> Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Michal Simek <monstr@monstr.eu>	2010-07-22 22:39:22 +02:00
Dan Carpenter	24a461d537	trace: strlen() return doesn't account for the NULL We need to add one to the strlen() return because of the NULL character. The type->name here generally comes from the kernel and I don't think any of them come close to being MAX_TRACER_SIZE (100) characters long so this is basically a cleanup. Signed-off-by: Dan Carpenter <error27@gmail.com> LKML-Reference: <20100710100644.GV19184@bicker> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-22 14:56:41 -04:00
Jason Wessel	edd63cb6b9	sysrq,kdb: Use __handle_sysrq() for kdb's sysrq function The kdb code should not toggle the sysrq state in case an end user wants to try and resume the normal kernel execution. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	b0679c63db	debug_core,kdb: fix kgdb_connected bit set in the wrong place Immediately following an exit from the kdb shell the kgdb_connected variable should be set to zero, unless there are breakpoints planted. If the kgdb_connected variable is not zeroed out with kdb, it is impossible to turn off kdb. This patch is merely a work around for now, the real fix will check for the breakpoints. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:07 -05:00
Jason Wessel	9e8b624fca	Fix merge regression from external kdb to upstream kdb In the process of merging kdb to the mainline, the kdb lsmod command stopped printing the base load address of kernel modules. This is needed for using kdb in conjunction with external tools such as gdb. Simply restore the functionality by adding a kdb_printf for the base load address of the kernel modules. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:06 -05:00
Jason Wessel	fb82c0ff27	repair gdbstub to match the gdbserial protocol specification The gdbserial protocol handler should return an empty packet instead of an error string when ever it responds to a command it does not implement. The problem cases come from a debugger client sending qTBuffer, qTStatus, qSearch, qSupported. The incorrect response from the gdbstub leads the debugger clients to not function correctly. Recent versions of gdb will not detach correctly as a result of this behavior. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>	2010-07-21 19:27:05 -05:00
Martin Hicks	1396a21ba0	kdb: break out of kdb_ll() when command is terminated Without this patch the "ll" linked-list traversal command won't terminate when you hit q/Q. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>	2010-07-21 19:27:05 -05:00
Josh Hunt	eebef74695	sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug The sched_child_runs_first value in /proc/sched_debug is currently displayed using a macro meant to split ns time values. This patch uses the correct macro to display it as a plain decimal value. Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: peterz@infradead.org Cc: juhlenko@akamai.com Cc: efault@gmx.de LKML-Reference: <1279567876-25418-1-git-send-email-johunt@akamai.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:46:12 +02:00
Ingo Molnar	dca45ad8af	Merge branch 'linus' into sched/core Merge reason: Move from the -rc3 to the almost-rc6 base. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:45:08 +02:00
Ingo Molnar	23c2875725	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2010-07-21 21:44:18 +02:00
Ingo Molnar	9dcdbf7a33	Merge branch 'linus' into perf/core Merge reason: Pick up the latest perf fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-21 21:43:06 +02:00
KOSAKI Motohiro	ef710e100c	tracing: Shrink max latency ringbuffer if unnecessary Documentation/trace/ftrace.txt says buffer_size_kb: This sets or displays the number of kilobytes each CPU buffer can hold. The tracer buffers are the same size for each CPU. The displayed number is the size of the CPU buffer and not total size of all buffers. The trace buffers are allocated in pages (blocks of memory that the kernel uses for allocation, usually 4 KB in size). If the last page allocated has room for more bytes than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size due to buffer management overhead. ) This can only be updated when the current_tracer is set to "nop". But it's incorrect. currently total memory consumption is 'buffer_size_kb x CPUs x 2'. Why two times difference is there? because ftrace implicitly allocate the buffer for max latency too. That makes sad result when admin want to use large buffer. (If admin want full logging and makes detail analysis). example, If admin have 24 CPUs machine and write 200MB to buffer_size_kb, the system consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is usually unacceptable. Fortunatelly, almost all users don't use max latency feature. The max latency buffer can be disabled easily. This patch shrink buffer size of the max latency buffer if unnecessary. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-21 10:20:17 -04:00
Lai Jiangshan	bc289ae98b	tracing: Reduce latency and remove percpu trace_seq __print_flags() and __print_symbolic() use percpu trace_seq: 1) Its memory is allocated at compile time, it wastes memory if we don't use tracing. 2) It is percpu data and it wastes more memory for multi-cpus system. 3) It disables preemption when it executes its core routine "trace_seq_printf(s, "%s: ", #call);" and introduces latency. So we move this trace_seq to struct trace_iterator. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4C078350.7090106@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 22:05:34 -04:00
Richard Kennedy	985023dee6	trace: Reorder struct ring_buffer_per_cpu to remove padding on 64bit Reorder structure to remove 8 bytes of padding on 64 bit builds. This shrinks the size to 128 bytes so allowing allocation from a smaller slab & needed one fewer cache lines. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> LKML-Reference: <1269516456.2054.8.camel@localhost> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:58:44 -04:00
Li Zefan	e870e9a124	tracing: Allow to disable cmdline recording We found that even enabling a single trace event that will rarely be triggered can add big overhead to context switch. (lmbench context switch test) ------------------------------------------------- 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ------ ------ ------ ------ ------ ------- ------- 2.19 2.3 2.21 2.56 2.13 2.54 2.07 2.39 2.51 2.35 2.75 2.27 2.81 2.24 The overhead is 6% ~ 11%. It's because when a trace event is enabled 3 tracepoints (sched_switch, sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname. We'd like to avoid this overhead, so add a trace option '(no)record-cmd' to allow to disable cmdline recording. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2010-07-20 21:52:33 -04:00
Neil Horman	70d4bf6d46	drop_monitor: convert some kfree_skb call sites to consume_skb Convert a few calls from kfree_skb to consume_skb Noticed while I was working on dropwatch that I was detecting lots of internal skb drops in several places. While some are legitimate, several were not, freeing skbs that were at the end of their life, rather than being discarded due to an error. This patch converts those calls sites from using kfree_skb to consume_skb, which quiets the in-kernel drop_monitor code from detecting them as drops. Tested successfully by myself Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-20 13:28:05 -07:00
Tejun Heo	f2e005aaff	workqueue: fix mayday_mask handling on UP All cpumasks are assumed to have cpu 0 permanently set on UP, so it can't be used to signify whether there's something to be done for the CPU. workqueue was using cpumask to track which CPU requested rescuer assistance and this led rescuer thread to think there always are pending mayday requests on UP, which resulted in infinite busy loops. This patch fixes the problem by introducing mayday_mask_t and associated helpers which wrap cpumask on SMP and emulates its behavior using bitops and unsigned long on UP. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>	2010-07-20 15:59:09 +02:00
Arnd Bergmann	b444786f1a	tracing: Use generic_file_llseek for debugfs The default for llseek will change to no_llseek, so the tracing debugfs files need to add explicit .llseek assignments. Since we're dealing with regular files from a VFS perspective, use generic_file_llseek. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <1278538820-1392-10-git-send-email-arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-20 14:31:24 +02:00
Frederic Weisbecker	eb7beb5c09	tracing: Remove special traces Special traces type was only used by sysprof. Lets remove it now that sysprof ftrace plugin has been dropped. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:31:07 +02:00
Frederic Weisbecker	f376bf5ffb	tracing: Remove sysprof ftrace plugin The sysprof ftrace plugin doesn't seem to be seriously used somewhere. There is a branch in the sysprof tree that makes an interface to it, but the real sysprof tool uses either its own module or perf events. Drop the sysprof ftrace plugin then, as it's mostly useless. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Soeren Sandmann <sandmann@daimi.au.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com>	2010-07-20 14:29:46 +02:00
Tejun Heo	931ac77ef6	workqueue: fix build problem on !CONFIG_SMP Commit `f3421797` (workqueue: implement unbound workqueue) incorrectly tested CONFIG_SMP as part of a C expression in alloc/free_cwqs(). As CONFIG_SMP is not defined in UP, this breaks build. Fix it by using Found during linux-next build test. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>	2010-07-20 11:15:14 +02:00
Catalin Marinas	9078370c0d	kmemleak: Add support for NO_BOOTMEM configurations With commits `08677214` and `59be5a8e`, alloc_bootmem()/free_bootmem() and friends use the early_res functions for memory management when NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the corresponding code paths for bootmem allocations. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@kernel.org	2010-07-19 11:54:15 +01:00
Pavel Machek	a2531293db	update email address pavel@suse.cz no longer works, replace it with working address. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-19 10:56:54 +02:00
Dan Kruchinin	5e017dc3f8	padata: Added sysfs primitives to padata subsystem Added sysfs primitives to padata subsystem. Now API user may embedded kobject each padata instance contains into any sysfs hierarchy. For now padata sysfs interface provides only two objects: serial_cpumask [RW] - cpumask for serial workers parallel_cpumask [RW] - cpumask for parallel workers Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Dan Kruchinin	e15bacbebb	padata: Make two separate cpumasks The aim of this patch is to make two separate cpumasks for padata parallel and serial workers respectively. It allows user to make more thin and sophisticated configurations of padata framework. For example user may bind parallel and serial workers to non-intersecting CPU groups to gain better performance. Also each padata instance has notifiers chain for its cpumasks now. If either parallel or serial or both masks were changed all interested subsystems will get notification about that. It's especially useful if padata user uses algorithm for callback CPU selection according to serial cpumask. Signed-off-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-19 13:50:19 +08:00
Rafael J. Wysocki	ce4410116c	PM / Suspend: Fix ordering of calls in suspend error paths The ACPI suspend code calls suspend_nvs_free() at a wrong place, which may lead to a memory leak if there's an error executing acpi_pm_prepare(), because acpi_pm_finish() will not be called in that case. However, the root cause of this problem is the apparently confusing ordering of calls in suspend error paths that needs to be fixed. In addition to that, fix a typo in a label name in suspend.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	d074ee023f	PM / Hibernate: Fix snapshot error code path There is an inconsistency between hibernation_platform_enter() and hibernation_snapshot(), because the latter calls hibernation_ops->end() after failing hibernation_ops->begin(), while the former doesn't do that. Make hibernation_snapshot() behave in the same way as hibernation_platform_enter() in that respect. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
Rafael J. Wysocki	f6f71f1875	PM / Hibernate: Fix hibernation_platform_enter() The hibernation_platform_enter() function calls dpm_suspend_noirq() instead of dpm_resume_noirq() by mistake. Fix this. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <len.brown@intel.com>	2010-07-19 02:00:35 +02:00
James Bottomley	82f682514a	pm_qos: Get rid of the allocation in pm_qos_add_request() All current users of pm_qos_add_request() have the ability to supply the memory required by the pm_qos routines, so make them do this and eliminate the kmalloc() with pm_qos_add_request(). This has the double benefit of making the call never fail and allowing it to be called from atomic context. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:34 +02:00
James Bottomley	5f279845f9	pm_qos: Reimplement using plists A lot of the pm_qos extremal value handling is really duplicating what a priority ordered list does, just in a less efficient fashion. Simply redoing the implementation in terms of a plist gets rid of a lot of this junk (although there are several other strange things that could do with tidying up, like pm_qos_request_list has to carry the pm_qos_class with every node, simply because it doesn't get passed in to pm_qos_update_request even though every caller knows full well what parameter it's updating). I think this redo is a win independent of android, so we should do something like this now. Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: mark gross <markgross@thegnar.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 02:00:18 +02:00
Rafael J. Wysocki	c125e96f04	PM: Make it possible to avoid races between wakeup and system sleep One of the arguments during the suspend blockers discussion was that the mainline kernel didn't contain any mechanisms making it possible to avoid races between wakeup and system suspend. Generally, there are two problems in that area. First, if a wakeup event occurs exactly when /sys/power/state is being written to, it may be delivered to user space right before the freezer kicks in, so the user space consumer of the event may not be able to process it before the system is suspended. Second, if a wakeup event occurs after user space has been frozen, it is not generally guaranteed that the ongoing transition of the system into a sleep state will be aborted. To address these issues introduce a new global sysfs attribute, /sys/power/wakeup_count, associated with a running counter of wakeup events and three helper functions, pm_stay_awake(), pm_relax(), and pm_wakeup_event(), that may be used by kernel subsystems to control the behavior of this attribute and to request the PM core to abort system transitions into a sleep state already in progress. The /sys/power/wakeup_count file may be read from or written to by user space. Reads will always succeed (unless interrupted by a signal) and return the current value of the wakeup events counter. Writes, however, will only succeed if the written number is equal to the current value of the wakeup events counter. If a write is successful, it will cause the kernel to save the current value of the wakeup events counter and to abort the subsequent system transition into a sleep state if any wakeup events are reported after the write has returned. [The assumption is that before writing to /sys/power/state user space will first read from /sys/power/wakeup_count. Next, user space consumers of wakeup events will have a chance to acknowledge or veto the upcoming system transition to a sleep state. Finally, if the transition is allowed to proceed, /sys/power/wakeup_count will be written to and if that succeeds, /sys/power/state will be written to as well. Still, if any wakeup events are reported to the PM core by kernel subsystems after that point, the transition will be aborted.] Additionally, put a wakeup events counter into struct dev_pm_info and make these per-device wakeup event counters available via sysfs, so that it's possible to check the activity of various wakeup event sources within the kernel. To illustrate how subsystems can use pm_wakeup_event(), make the low-level PCI runtime PM wakeup-handling code use it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: markgross <markgross@thegnar.org> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>	2010-07-19 01:58:48 +02:00
Cesar Eduardo Barros	9013367339	PM / Hibernate: Fix typos in comments in kernel/power/swap.c There are a few typos in kernel/power/swap.c. Fix them. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2010-07-19 01:58:47 +02:00
Pekka Enberg	68c38fc3cb	sched: No need for bootmem special cases As of commit `dcce284` ("mm: Extend gfp masking to the page allocator") and commit `7e85ee0` ("slab,slub: don't enable interrupts during early boot"), the slab allocator makes sure we don't attempt to sleep during boot. Therefore, remove bootmem special cases from the scheduler and use plain GFP_KERNEL instead. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1279225102-2572-1-git-send-email-penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:06:22 +02:00
Peter Zijlstra	396e894d28	sched: Revert nohz_ratelimit() for now Norbert reported that nohz_ratelimit() causes his laptop to burn about 4W (40%) extra. For now back out the change and see if we can adjust the power management code to make better decisions. Reported-by: Norbert Preining <preining@logic.at> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:44 +02:00
Peter Zijlstra	bbc8cb5bae	sched: Reduce update_group_power() calls Currently we update cpu_power() too often, update_group_power() only updates the local group's cpu_power but it gets called for all groups. Furthermore, CPU_NEWLY_IDLE invocations will result in all cpus calling it, even though a slow update of cpu_power is sufficient. Therefore move the update under 'idle != CPU_NEWLY_IDLE && local_group' to reduce superfluous invocations. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <1278612989.1900.176.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:05:14 +02:00
Suresh Siddha	5343bdb8fd	sched: Update rq->clock for nohz balanced cpus Suresh spotted that we don't update the rq->clock in the nohz load-balancer path. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1278626014.2834.74.camel@sbs-t61.sc.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-17 12:02:08 +02:00
Jiri Slaby	c022a0acad	rlimits: implement prlimit64 syscall This patch adds the code to support the sys_prlimit64 syscall which modifies-and-returns the rlim values of a selected process atomically. The first parameter, pid, being 0 means current process. Unlike the current implementation, it is a generic interface, architecture indepentent so that we needn't handle compat stuff anymore. In the future, after glibc start to use this we can deprecate sys_setrlimit and sys_getrlimit in favor to clean up the code finally. It also adds a possibility of changing limits of other processes. We check the user's permissions to do that and if it succeeds, the new limits are propagated online. This is good for large scale applications such as SAP or databases where administrators need to change limits time by time (e.g. on crashes increase core size). And it is unacceptable to restart the service. For safety, all rlim users now either use accessors or doesn't need them due to - locking - the fact a process was just forked and nobody else knows about it yet (and nobody can't thus read/write limits) hence it is safe to modify limits now. The limitation is that we currently stay at ulong internal representation. So the rlim64_is_infinity check is used where value is compared against ULONG_MAX on 32-bit which is the maximum value there. And since internally the limits are held in struct rlimit, converters which are used before and after do_prlimit call in sys_prlimit64 are introduced. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	b95183453a	rlimits: switch more rlimit syscalls to do_prlimit After we added more generic do_prlimit, switch sys_getrlimit to that. Also switch compat handling, so we can get rid of ugly __user casts and avoid setting process' address limit to kernel data and back. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	5b41535aac	rlimits: redo do_setrlimit to more generic do_prlimit It now allows also reading of limits. I.e. all read and writes will later use this function. It takes two parameters, new and old limits which can be both NULL. If new is non-NULL, the value in it is set to rlimits. If old is non-NULL, current rlimits are stored there. If both are non-NULL, old are stored prior to setting the new ones, atomically. (Similar to sigaction.) Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:48 +02:00
Jiri Slaby	86f162f4c7	rlimits: do security check under task_lock Do security_task_setrlimit under task_lock. Other tasks may change limits under our hands while we are checking limits inside the function. From now on, they can't. Note that all the security work is done under a spinlock here now. Security hooks count with that, they are called from interrupt context (like security_task_kill) and with spinlocks already held (e.g. capable->security_capable). Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: James Morris <jmorris@namei.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	1c1e618ddd	rlimits: allow setrlimit to non-current tasks Add locking to allow setrlimit accept task parameter other than current. Namely, lock tasklist_lock for read and check whether the task structure has sighand non-null. Do all the signal processing under that lock still held. There are some points: 1) security_task_setrlimit is now called with that lock held. This is not new, many security_* functions are called with this lock held already so it doesn't harm (all this security_* stuff does almost the same). 2) task->sighand->siglock (in update_rlimit_cpu) is nested in tasklist_lock. This dependence is already existing. 3) tsk->alloc_lock is nested in tasklist_lock. This is OK too, already existing dependence. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-07-16 09:48:47 +02:00
Jiri Slaby	7855c35da7	rlimits: split sys_setrlimit Create do_setrlimit from sys_setrlimit and declare do_setrlimit in the resource header. This is the first phase to have generic do_prlimit which allows to be called from read, write and compat rlimits code. The new do_setrlimit also accepts a task pointer to change the limits of. Currently, it cannot be other than current, but this will change with locking later. Also pass tsk->group_leader to security_task_setrlimit to check whether current is allowed to change rlimits of the process and not its arbitrary thread because it makes more sense given that rlimit are per process and not per-thread. Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Oleg Nesterov	2fb9d2689a	rlimits: make sure ->rlim_max never grows in sys_setrlimit Mostly preparation for Jiri's changes, but probably makes sense anyway. sys_setrlimit() checks new_rlim.rlim_max <= old_rlim->rlim_max, but when it takes task_lock() old_rlim->rlim_max can be already lowered. Move this check under task_lock(). Currently this is not important, we can only race with our sub-thread, this means the application is stupid. But when we change the code to allow the update of !current task's limits, it becomes important to make sure ->rlim_max can be lowered "reliably" even if we race with the application doing sys_setrlimit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>	2010-07-16 09:48:46 +02:00
Jiri Slaby	5ab46b345e	rlimits: add task_struct to update_rlimit_cpu Add task_struct as a parameter to update_rlimit_cpu to be able to set rlimit_cpu of different task than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Jiri Slaby	8fd00b4d70	rlimits: security, add task_struct to setrlimit Add task_struct to task_setrlimit of security_operations to be able to set rlimit of task other than current. Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> Acked-by: James Morris <jmorris@namei.org>	2010-07-16 09:48:45 +02:00
Frederic Weisbecker	5d550467b9	tracing: Remove ksym tracer The ksym (breakpoint) ftrace plugin has been superseded by perf tools that are much more poweful to use the cpu breakpoints. This tracer doesn't bring more feature. It has been deprecated for a while now, lets remove it. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu>	2010-07-15 23:59:33 +02:00
Steffen Klassert	5f1a8c1bc7	padata: simplify serialization mechanism We count the number of processed objects on a percpu basis, so we need to go through all the percpu reorder queues to calculate the sequence number of the next object that needs serialization. This patch changes this to count the number of processed objects global. So we can calculate the sequence number and the percpu reorder queue of the next object that needs serialization without searching through the percpu reorder queues. This avoids some accesses to memory of foreign cpus. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:30 +08:00
Steffen Klassert	83f619f3c8	padata: make padata_do_parallel to return zero on success To return -EINPROGRESS on success in padata_do_parallel was considered to be odd. This patch changes this to return zero on success. Also the only user of padata, pcrypt is adapted to convert a return of zero to -EINPROGRESS within the crypto layer. This also removes the pcrypt fallback if padata_do_parallel was called on a not running padata instance as we can't handle it anymore. This fallback was unused, so it's save to remove it. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	33e5445068	padata: Handle empty padata cpumasks This patch fixes a bug when the padata cpumask does not intersect with the active cpumask. In this case we get a division by zero in padata_alloc_pd and we end up with a useless padata instance. Padata can end up with an empty cpumask for two reasons: 1. A user removed the last cpu that belongs to the padata cpumask and the active cpumask. 2. The last cpu that belongs to the padata cpumask and the active cpumask goes offline. We introduce a function padata_validate_cpumask to check if the padata cpumask does intersect with the active cpumask. If the cpumasks do not intersect we mark the instance as invalid, so it can't be used. We do not allocate the cpumask dependend recources in this case. This fixes the division by zero and keeps the padate instance in a consistent state. It's not possible to trigger this bug by now because the only padata user, pcrypt uses always the possible cpumask. Reported-by: Dan Kruchinin <dkruchinin@acm.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	ee83655512	padata: Block until the instance is unused on stop This patch makes padata_stop to block until the padata instance is unused. Also we split padata_stop to a locked and a unlocked version. This is in preparation to be able to change the cpumask after a call to patata stop. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:29 +08:00
Steffen Klassert	4c87917029	padata: Check for valid padata instance on start This patch introduces the PADATA_INVALID flag which is checked on padata start. This will be used to mark a padata instance as invalid, if the padata cpumask does not intersect with the active cpumask. we change padata_start to return an error if the PADATA_INVALID is set. Also we adapt the only padata user, pcrypt to this change. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2010-07-14 20:29:28 +08:00
Tejun Heo	9f9c23644b	workqueue: fix locking in retry path of maybe_create_worker() maybe_create_worker() mismanaged locking when worker creation fails and it has to retry. Fix locking and simplify lock manipulation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Yong Zhang <yong.zhang@windriver.com>	2010-07-14 11:31:20 +02:00
Tejun Heo	083b804c4d	async: use workqueue for worker pool Replace private worker pool with system_unbound_wq. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Arjan van de Ven <arjan@infradead.org>	2010-07-14 11:29:46 +02:00
Uwe Kleine-König	698f93159a	fix comment/printk typos concerning "already" Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-07-11 21:45:40 +02:00
Arnd Bergmann	5e3d20a68f	init: Remove the BKL from startup code I have shown by code review that no driver takes the BKL at init time any more, so whatever the init code was locking against is no longer there and it is now safe to remove the BKL there. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Steven Rostedt <rostedt@goodmis> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2010-07-09 15:40:32 +02:00
Benjamin Herrenschmidt	5f07aa7524	Merge commit 'paulus-perf/master' into next	2010-07-09 11:25:48 +10:00
Kulikov Vasiliy	eb703f9819	kernel/watchdog: Initialize 'result' Variable on the stack is not initialized to zero, do it explicitly. This bug was found by a compiler warning: kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function Signed-off-by: Kulikov Vasiliy <segooon@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-07 08:46:42 +02:00
Ingo Molnar	8bd0e1be25	Merge branch 'perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux-2.6 into perf/core	2010-07-06 09:00:32 +02:00
Masami Hiramatsu	e09c8614b3	tracing/kprobes: Support "string" type Support string type tracing and printing in kprobe-tracer. This allows user to trace string data in kernel including __user data. Note that sometimes __user data may not be accessed if it is paged-out (sorry, but kprobes operation should be done in atomic, we can not wait for page-in). Commiter note: Fixed up conflicts with `b7e2ece`. Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6> Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2010-07-05 15:54:45 -03:00
Ingo Molnar	08f8ba0799	Merge commit 'v2.6.35-rc4' into perf/core Merge reason: Pick up the latest perf fixes Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-05 08:30:58 +02:00
Yehuda Sadeh	ff49d74ad3	module: initialize module dynamic debug later We should initialize the module dynamic debug datastructures only after determining that the module is not loaded yet. This fixes a bug that introduced in 2.6.35-rc2, where when a trying to load a module twice, we also load it's dynamic printing data twice which causes all sorts of nasty issues. Also handle the dynamic debug cleanup later on failure. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (removed a #ifdef) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-07-04 20:17:22 -07:00
Linus Torvalds	123f94f22e	Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Cure nr_iowait_cpu() users init: Fix comment init, sched: Fix race between init and kthreadd	2010-07-02 09:52:58 -07:00
Tejun Heo	c7fc77f78f	workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full ordering among works queued to a workqueue. The same can be achieved using WQ_UNBOUND as unbound workqueues always use the gcwq for WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu locality isn't accessible anyway, serving them with unbound workqueues should be fine. Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most single thread workqueue users will be converted to use multithread or non-reentrant instead and only the ones which require strict ordering will keep using WQ_UNBOUND + @max_active of 1. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 11:00:08 +02:00
Tejun Heo	f34217977d	workqueue: implement unbound workqueue This patch implements unbound workqueue which can be specified with WQ_UNBOUND flag on creation. An unbound workqueue has the following properties. * It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND. This gcwq is always online and disassociated. * Workers are not bound to any CPU and not concurrency managed. Works are dispatched to workers as soon as possible and the only applied limitation is @max_active. IOW, all unbound workqeueues are implicitly high priority. Unbound workqueues can be used as simple execution context provider. Contexts unbound to any cpu are served as soon as possible. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com>	2010-07-02 11:00:02 +02:00
Tejun Heo	bdbc5dd7de	workqueue: prepare for WQ_UNBOUND implementation In preparation of WQ_UNBOUND addition, make the following changes. * Add WORK_CPU_* constants for pseudo cpu id numbers used (currently only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to allow another pseudo cpu id for unbound cpu. * Reorder WQ_* flags. * Make workqueue_struct->cpu_wq a union which contains a percpu pointer, regular pointer and an unsigned long value and use kzalloc/kfree() in UP allocation path. This will be used to implement unbound workqueues which will use only one cwq on SMPs. * Move alloc_cwqs() allocation after initialization of wq fields, so that alloc_cwqs() has access to wq->flags. * Trivial relocation of wq local variables in freeze functions. These changes don't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:59:57 +02:00
Tejun Heo	d313dd85ad	workqueue: fix worker management invocation without pending works When there's no pending work to do, worker_thread() goes back to sleep after waking up without checking whether worker management is necessary. This means that idle worker exit requests can be ignored if the gcwq stays empty. Fix it by making worker_thread() always check whether worker management is necessary before going to sleep. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	a1e453d279	workqueue: fix incorrect cpu number BUG_ON() in get_work_gcwq() get_work_gcwq() was incorrectly triggering BUG_ON() if cpu number is equal to or higher than num_possible_cpus() instead of nr_cpu_ids. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	4ce48b37bf	workqueue: fix race condition in flush_workqueue() When one flusher is cascading to the next flusher, it first sets wq->first_flusher to the next one and sets up the next flush cycle. If there's nothing to do for the next cycle, it clears wq->flush_flusher and proceeds to the one after that. If the woken up flusher checks wq->first_flusher before it gets cleared, it will incorrectly assume the role of the first flusher, which triggers BUG_ON() sanity check. Fix it by checking wq->first_flusher again after grabbing the mutex. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:51 +02:00
Tejun Heo	cb44476699	workqueue: use worker_set/clr_flags() only from worker itself worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set the worker must be contributing to nr_running which is only true if the worker is actually running. As when called from self, it is guaranteed that the worker is running, those functions can be safely used from the worker itself and they aren't necessary from other places anyway. Make the following changes to fix the bug. * Make worker_set/clr_flags() whine if not called from self. * Convert all places which called those functions from other tasks to manipulate flags directly. * Make trustee_thread() directly clear nr_running after setting WORKER_ROGUE on all workers. This is the only place where nr_running manipulation is necessary outside of workers themselves. * While at it, add sanity check for nr_running in worker_enter_idle(). Signed-off-by: Tejun Heo <tj@kernel.org>	2010-07-02 10:03:50 +02:00
Peter Zijlstra	8c215bd389	sched: Cure nr_iowait_cpu() users Commit `0224cf4c5e` (sched: Intoduce get_cpu_iowait_time_us()) broke things by not making sure preemption was indeed disabled by the callers of nr_iowait_cpu() which took the iowait value of the current cpu. This resulted in a heap of preempt warnings. Cure this by making nr_iowait_cpu() take a cpu number and fix up the callers to pass in the right number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: linux-pm@lists.linux-foundation.org LKML-Reference: <1277968037.1868.120.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:39:48 +02:00
Ingo Molnar	0a54cec0c2	Merge branch 'linus' into core/rcu Conflicts: fs/fs-writeback.c Merge reason: Resolve the conflict Note, i picked the version from Linus's tree, which effectively reverts the fs-writeback.c bits of: b97181f: fs: remove all rcu head initializations, except on_stack initializations As the upstream changes to this file changed this code heavily and the first attempt to resolve the conflict resulted in a non-booting kernel. It's safer to re-try this portion of the commit cleanly. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-07-01 09:31:25 +02:00
Michal Hocko	7a0ea09ad5	futex: futex_find_get_task remove credentails check futex_find_get_task is currently used (through lookup_pi_state) from two contexts, futex_requeue and futex_lock_pi_atomic. None of the paths looks it needs the credentials check, though. Different (e)uids shouldn't matter at all because the only thing that is important for shared futex is the accessibility of the shared memory. The credentail check results in glibc assert failure or process hang (if glibc is compiled without assert support) for shared robust pthread mutex with priority inheritance if a process tries to lock already held lock owned by a process with a different euid: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 \|\| !robust' failed. The problem is that futex_lock_pi_atomic which is called when we try to lock already held lock checks the current holder (tid is stored in the futex value) to get the PI state. It uses lookup_pi_state which in turn gets task struct from futex_find_get_task. ESRCH is returned either when the task is not found or if credentials check fails. futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code, however, doesn't expect that robust lock returns with ESRCH because it should get either success or owner died. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-30 15:43:44 -07:00
Pavan Naregundi	e05bd3367b	kexec: fix Oops in crash_shrink_memory() When crashkernel is not enabled, "echo 0 > /sys/kernel/kexec_crash_size" OOPSes the kernel in crash_shrink_memory. This happens when crash_shrink_memory tries to release the 'crashk_res' resource which are not reserved. Also value of "/sys/kernel/kexec_crash_size" shows as 1, which should be 0. This patch fixes the OOPS in crash_shrink_memory and shows "/sys/kernel/kexec_crash_size" as 0 when crash kernel memory is not reserved. Signed-off-by: Pavan Naregundi <pavan@linux.vnet.ibm.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Simon Horman <horms@verge.net.au> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-29 15:29:31 -07:00
Michael Neuling	2ec57d448b	sched: Fix spelling of sibling No logic changes, only spelling. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: linuxppc-dev@ozlabs.org Cc: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <15249.1277776921@neuling.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2010-06-29 10:44:29 +02:00
Tejun Heo	fb0e7beb5c	workqueue: implement cpu intensive workqueue This patch implements cpu intensive workqueue which can be specified with WQ_CPU_INTENSIVE flag on creation. Works queued to a cpu intensive workqueue don't participate in concurrency management. IOW, it doesn't contribute to gcwq->nr_running and thus doesn't delay excution of other works. Note that although cpu intensive works won't delay other works, they can be delayed by other works. Combine with WQ_HIGHPRI to avoid being delayed by other works too. As the name suggests this is useful when using workqueue for cpu intensive works. Workers executing cpu intensive works are not considered for workqueue concurrency management and left for the scheduler to manage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:15 +02:00
Tejun Heo	649027d73a	workqueue: implement high priority workqueue This patch implements high priority workqueue which can be specified with WQ_HIGHPRI flag on creation. A high priority workqueue has the following properties. * A work queued to it is queued at the head of the worklist of the respective gcwq after other highpri works, while normal works are always appended at the end. * As long as there are highpri works on gcwq->worklist, [__]need_more_worker() remains %true and process_one_work() wakes up another worker before it start executing a work. The above two properties guarantee that works queued to high priority workqueues are dispatched to workers and start execution as soon as possible regardless of the state of other works. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	dcd989cb73	workqueue: implement several utility APIs Implement the following utility APIs. workqueue_set_max_active() : adjust max_active of a wq workqueue_congested() : test whether a wq is contested work_cpu() : determine the last / current cpu of a work work_busy() : query whether a work is busy * Anton Blanchard fixed missing ret initialization in work_busy(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d320c03830	workqueue: s/__create_workqueue()/alloc_workqueue()/, and add system workqueues This patch makes changes to make new workqueue features available to its users. * Now that workqueue is more featureful, there should be a public workqueue creation function which takes paramters to control them. Rename __create_workqueue() to alloc_workqueue() and make 0 max_active mean WQ_DFL_ACTIVE. In the long run, all create_workqueue_() will be converted over to alloc_workqueue(). To further unify access interface, rename keventd_wq to system_wq and export it. * Add system_long_wq and system_nrt_wq. The former is to host long running works separately (so that flush_scheduled_work() dosen't take so long) and the latter guarantees any queued work item is never executed in parallel by multiple CPUs. These will be used by future patches to update workqueue users. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	b71ab8c202	workqueue: increase max_active of keventd and kill current_is_keventd() Define WQ_MAX_ACTIVE and create keventd with max_active set to half of it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1 works concurrently. Unless some combination can result in dependency loop longer than max_active, deadlock won't happen and thus it's unnecessary to check whether current_is_keventd() before trying to schedule a work. Kill current_is_keventd(). (Lockdep annotations are broken. We need lock_map_acquire_read_norecurse()) Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com>	2010-06-29 10:07:14 +02:00
Tejun Heo	e22bee782b	workqueue: implement concurrency managed dynamic worker pool Instead of creating a worker for each cwq and putting it into the shared pool, manage per-cpu workers dynamically. Works aren't supposed to be cpu cycle hogs and maintaining just enough concurrency to prevent work processing from stalling due to lack of processing context is optimal. gcwq keeps the number of concurrent active workers to minimum but no less. As long as there's one or more running workers on the cpu, no new worker is scheduled so that works can be processed in batch as much as possible but when the last running worker blocks, gcwq immediately schedules new worker so that the cpu doesn't sit idle while there are works to be processed. gcwq always keeps at least single idle worker around. When a new worker is necessary and the worker is the last idle one, the worker assumes the role of "manager" and manages the worker pool - ie. creates another worker. Forward-progress is guaranteed by having dedicated rescue workers for workqueues which may be necessary while creating a new worker. When the manager is having problem creating a new worker, mayday timer activates and rescue workers are summoned to the cpu and execute works which might be necessary to create new workers. Trustee is expanded to serve the role of manager while a CPU is being taken down and stays down. As no new works are supposed to be queued on a dead cpu, it just needs to drain all the existing ones. Trustee continues to try to create new workers and summon rescuers as long as there are pending works. If the CPU is brought back up while the trustee is still trying to drain the gcwq from the previous offlining, the trustee will kill all idles ones and tell workers which are still busy to rebind to the cpu, and pass control over to gcwq which assumes the manager role as necessary. Concurrency managed worker pool reduces the number of workers drastically. Only workers which are necessary to keep the processing going are created and kept. Also, it reduces cache footprint by avoiding unnecessarily switching contexts between different workers. Please note that this patch does not increase max_active of any workqueue. All workqueues can still only process one work per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:14 +02:00
Tejun Heo	d302f01782	workqueue: implement worker_{set\|clr}_flags() Implement worker_{set\|clr}_flags() to manipulate worker flags. These are currently simple wrappers but logics to track the current worker state and the current level of concurrency will be added. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7e11629d0e	workqueue: use shared worklist and pool all workers per cpu Use gcwq->worklist instead of cwq->worklist and break the strict association between a cwq and its worker. All works queued on a cpu are queued on gcwq->worklist and processed by any available worker on the gcwq. As there no longer is strict association between a cwq and its worker, whether a work is executing can now only be determined by calling [__]find_worker_executing_work(). After this change, the only association between a cwq and its worker is that a cwq puts a worker into shared worker pool on creation and kills it on destruction. As all workqueues are still limited to max_active of one, this means that there are always at least as many workers as active works and thus there's no danger for deadlock. The break of strong association between cwqs and workers requires somewhat clumsy changes to current_is_keventd() and destroy_workqueue(). Dynamic worker pool management will remove both clumsy changes. current_is_keventd() won't be necessary at all as the only reason it exists is to avoid queueing a work from a work which will be allowed just fine. The clumsy part of destroy_workqueue() is added because a worker can only be destroyed while idle and there's no guarantee a worker is idle when its wq is going down. With dynamic pool management, workers are not associated with workqueues at all and only idle ones will be submitted to destroy_workqueue() so the code won't be necessary anymore. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	18aa9effad	workqueue: implement WQ_NON_REENTRANT With gcwq managing all the workers and work->data pointing to the last gcwq it was on, non-reentrance can be easily implemented by checking whether the work is still running on the previous gcwq on queueing. Implement it. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	7a22ad757e	workqueue: carry cpu number in work data once execution starts To implement non-reentrant workqueue, the last gcwq a work was executed on must be reliably obtainable as long as the work structure is valid even if the previous workqueue has been destroyed. To achieve this, work->data will be overloaded to carry the last cpu number once execution starts so that the previous gcwq can be located reliably. This means that cwq can't be obtained from work after execution starts but only gcwq. Implement set_work_{cwq\|cpu}(), get_work_[g]cwq() and clear_work_data() to set work data to the cpu number when starting execution, access the overloaded work data and clear it after cancellation. queue_delayed_work_on() is updated to preserve the last cpu while in-flight in timer and other callers which depended on getting cwq from work after execution starts are converted to depend on gcwq instead. * Anton Blanchard fixed compile error on powerpc due to missing linux/threads.h include. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	8cca0eea39	workqueue: add find_worker_executing_work() and track current_cwq Now that all the workers are tracked by gcwq, we can find which worker is executing a work from gcwq. Implement find_worker_executing_work() and make worker track its current_cwq so that we can find things the other way around. This will be used to implement non-reentrant wqs. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	502ca9d819	workqueue: make single thread workqueue shared worker pool friendly Reimplement st (single thread) workqueue so that it's friendly to shared worker pool. It was originally implemented by confining st workqueues to use cwq of a fixed cpu and always having a worker for the cpu. This implementation isn't very friendly to shared worker pool and suboptimal in that it ends up crossing cpu boundaries often. Reimplement st workqueue using dynamic single cpu binding and cwq->limit. WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU. In a single cpu workqueue, at most single cwq is bound to the wq at any given time. Arbitration is done using atomic accesses to wq->single_cpu when queueing a work. Once bound, the binding stays till the workqueue is drained. Note that the binding is never broken while a workqueue is frozen. This is because idle cwqs may have works waiting in delayed_works queue while frozen. On thaw, the cwq is restarted if there are any delayed works or unbound otherwise. When combined with max_active limit of 1, single cpu workqueue has exactly the same execution properties as the original single thread workqueue while allowing sharing of per-cpu workers. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:13 +02:00
Tejun Heo	db7bccf45c	workqueue: reimplement CPU hotplugging support using trustee Reimplement CPU hotplugging support using trustee thread. On CPU down, a trustee thread is created and each step of CPU down is executed by the trustee and workqueue_cpu_callback() simply drives and waits for trustee state transitions. CPU down operation no longer waits for works to be drained but trustee sticks around till all pending works have been completed. If CPU is brought back up while works are still draining, workqueue_cpu_callback() tells trustee to step down and tell workers to rebind to the cpu. As it's difficult to tell whether cwqs are empty if it's freezing or frozen, trustee doesn't consider draining to be complete while a gcwq is freezing or frozen (tracked by new GCWQ_FREEZING flag). Also, workers which get unbound from their cpu are marked with WORKER_ROGUE. Trustee based implementation doesn't bring any new feature at this point but it will be used to manage worker pool when dynamic shared worker pool is implemented. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c8e55f3602	workqueue: implement worker states Implement worker states. After created, a worker is STARTED. While a worker isn't processing a work, it's IDLE and chained on gcwq->idle_list. While processing a work, a worker is BUSY and chained on gcwq->busy_hash. Also, gcwq now counts the number of all workers and idle ones. worker_thread() is restructured to reflect state transitions. cwq->more_work is removed and waking up a worker makes it check for events. A worker is killed by setting DIE flag while it's IDLE and waking it up. This gives gcwq better visibility of what's going on and allows it to find out whether a work is executing quickly which is necessary to have multiple workers processing the same cwq. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	8b03ae3cde	workqueue: introduce global cwq and unify cwq locks There is one gcwq (global cwq) per each cpu and all cwqs on an cpu point to it. A gcwq contains a lock to be used by all cwqs on the cpu and an ida to give IDs to workers belonging to the cpu. This patch introduces gcwq, moves worker_ida into gcwq and make all cwqs on the same cpu use the cpu's gcwq->lock instead of separate locks. gcwq->ida is now protected by gcwq->lock too. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	a0a1a5fd4f	workqueue: reimplement workqueue freeze using max_active Currently, workqueue freezing is implemented by marking the worker freezeable and calling try_to_freeze() from dispatch loop. Reimplement it using cwq->limit so that the workqueue is frozen instead of the worker. * workqueue_struct->saved_max_active is added which stores the specified max_active on initialization. * On freeze, all cwq->max_active's are quenched to zero. Freezing is complete when nr_active on all cwqs reach zero. * On thaw, all cwq->max_active's are restored to wq->saved_max_active and the worklist is repopulated. This new implementation allows having single shared pool of workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	1e19ffc63d	workqueue: implement per-cwq active work limit Add cwq->nr_active, cwq->max_active and cwq->delayed_work. nr_active counts the number of active works per cwq. A work is active if it's flushable (colored) and is on cwq's worklist. If nr_active reaches max_active, new works are queued on cwq->delayed_work and activated later as works on the cwq complete and decrement nr_active. cwq->max_active can be specified via the new @max_active parameter to __create_workqueue() and is set to 1 for all workqueues for now. As each cwq has only single worker now, this double queueing doesn't cause any behavior difference visible to its users. This will be used to reimplement freeze/thaw and implement shared worker pool. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	affee4b294	workqueue: reimplement work flushing using linked works A work is linked to the next one by having WORK_STRUCT_LINKED bit set and these links can be chained. When a linked work is dispatched to a worker, all linked works are dispatched to the worker's newly added ->scheduled queue and processed back-to-back. Currently, as there's only single worker per cwq, having linked works doesn't make any visible behavior difference. This change is to prepare for multiple shared workers per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:12 +02:00
Tejun Heo	c34056a3fd	workqueue: introduce worker Separate out worker thread related information to struct worker from struct cpu_workqueue_struct and implement helper functions to deal with the new struct worker. The only change which is visible outside is that now workqueue worker are all named "kworker/CPUID:WORKERID" where WORKERID is allocated from per-cpu ida. This is in preparation of concurrency managed workqueue where shared multiple workers would be available per cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00
Tejun Heo	73f53c4aa7	workqueue: reimplement workqueue flushing using color coded works Reimplement workqueue flushing using color coded works. wq has the current work color which is painted on the works being issued via cwqs. Flushing a workqueue is achieved by advancing the current work colors of cwqs and waiting for all the works which have any of the previous colors to drain. Currently there are 16 possible colors, one is reserved for no color and 15 colors are useable allowing 14 concurrent flushes. When color space gets full, flush attempts are batched up and processed together when color frees up, so even with many concurrent flushers, the new implementation won't build up huge queue of flushers which has to be processed one after another. Only works which are queued via __queue_work() are colored. Works which are directly put on queue using insert_work() use NO_COLOR and don't participate in workqueue flushing. Currently only works used for work-specific flush fall in this category. This new implementation leaves only cleanup_workqueue_thread() as the user of flush_cpu_workqueue(). Just make its users use flush_workqueue() and kthread_stop() directly and kill cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier request anymore, the comment describing the complex synchronization around it in cleanup_workqueue_thread() is removed together with the function. This new implementation is to allow having and sharing multiple workers per cpu. Please note that one more bit is reserved for a future work flag by this patch. This is to avoid shifting bits and updating comments later. Signed-off-by: Tejun Heo <tj@kernel.org>	2010-06-29 10:07:11 +02:00

... 2 3 4 5 6 ...

10272 Commits