linux

Commit Graph

Author	SHA1	Message	Date
Suresh Siddha	1a8880a142	x86, apic: Make apic drivers static Apic probe now looks at the apic drivers listed in the .apicdrivers section. Remove apic_probe[] and make each apic driver static. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: steiner@sgi.com Cc: gorcunov@openvz.org Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20110521005526.341718626@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-22 11:48:04 +02:00
Suresh Siddha	69c252ffce	x86, apic: Clean up bigsmp apic selection code Make generic_bigsmp_probe() return struct apic *. This will avoid exporting apic_bigsmp, which will be consistent with others. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: steiner@sgi.com Cc: gorcunov@openvz.org Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20110521005526.252703851@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-22 11:48:03 +02:00
Suresh Siddha	107e0e0cd8	x86, apic: Introduce .apicdrivers section to find the list of apic drivers This will pave the way for each apic driver to be self-contained and eliminate the need for apic_probe[]. Order in which apic drivers are listed in the .apicdrivers section is important, as this determines the apic probe order. And this is enforced by the ordering of apic driver files in the Makefile and the macros apic_driver()/apic_drivers(). Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: steiner@sgi.com Cc: gorcunov@openvz.org Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20110521005526.068775085@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-22 11:48:03 +02:00
Linus Torvalds	06f4e926d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits) macvlan: fix panic if lowerdev in a bond tg3: Add braces around 5906 workaround. tg3: Fix NETIF_F_LOOPBACK error macvlan: remove one synchronize_rcu() call networking: NET_CLS_ROUTE4 depends on INET irda: Fix error propagation in ircomm_lmp_connect_response() irda: Kill set but unused variable 'bytes' in irlan_check_command_param() irda: Kill set but unused variable 'clen' in ircomm_connect_indication() rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport() be2net: Kill set but unused variable 'req' in lancer_fw_download() irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication() atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined. rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer(). rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler() rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection() rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window() pkt_sched: Kill set but unused variable 'protocol' in tc_classify() isdn: capi: Use pr_debug() instead of ifdefs. tg3: Update version to 3.119 tg3: Apply rx_discards fix to 5719/5720 ... Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c as per Davem.	2011-05-20 13:43:21 -07:00
Linus Torvalds	268bb0ce3e	sanitize <linux/prefetch.h> usage Commit `e66eed651f` ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw(' -- '.[ch]') grep -L 'prefetchw(' $(git grep -l 'linux/prefetch.h' -- '.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-20 12:50:29 -07:00
Cyrill Gorcunov	79deb8e511	x86, x2apic: Move the common bits to x2apic.h To eliminate code duplication. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: steiner@sgi.com Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20110519234637.591426753@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 13:41:11 +02:00
Suresh Siddha	c040aaeb86	x86, ioapic: Consolidate gsi routing info into 'struct ioapic' Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: daniel.blueman@gmail.com Link: http://lkml.kernel.org/r/20110518233157.994002011@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 13:41:01 +02:00
Suresh Siddha	d537143084	x86, ioapic: Consolidate mp_ioapics[] into 'struct ioapic' Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: daniel.blueman@gmail.com Link: http://lkml.kernel.org/r/20110518233157.909013179@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 13:40:59 +02:00
Suresh Siddha	b69c6c3bec	x86, ioapic: Add struct ioapic Introduce struct ioapic with nr_registers field. This will pave way for consolidating different MAX_IO_APICS arrays into it. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: daniel.blueman@gmail.com Link: http://lkml.kernel.org/r/20110518233157.744315519@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 13:40:56 +02:00
Suresh Siddha	31dce14a32	x86, ioapic: Use ioapic_saved_data while enabling intr-remapping Code flow for enabling interrupt-remapping was allocating/freeing buffers for saving/restoring io-apic RTE's. ioapic suspend/resume code uses boot time allocated ioapic_saved_data that is a perfect match for reuse here. This will remove the unnecessary allocation/free of the temporary buffers during suspend/resume of interrupt-remapping enabled platforms aswell as paving the way for further code consolidation. Tested-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110518233157.574469296@sbsiddha-MOBL3.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-20 13:40:52 +02:00
Linus Torvalds	39ab05c8e0	Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits) debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning sysfs: remove "last sysfs file:" line from the oops messages drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION" memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION SYSFS: Fix erroneous comments for sysfs_update_group(). driver core: remove the driver-model structures from the documentation driver core: Add the device driver-model structures to kerneldoc Translated Documentation/email-clients.txt RAW driver: Remove call to kobject_put(). reboot: disable usermodehelper to prevent fs access efivars: prevent oops on unload when efi is not enabled Allow setting of number of raw devices as a module parameter Introduce CONFIG_GOOGLE_FIRMWARE driver: Google Memory Console driver: Google EFI SMI x86: Better comments for get_bios_ebda() x86: get_bios_ebda_length() misc: fix ti-st build issues params.c: Use new strtobool function to process boolean inputs debugfs: move to new strtobool ... Fix up trivial conflicts in fs/debugfs/file.c due to the same patch being applied twice, and an unrelated cleanup nearby.	2011-05-19 18:24:11 -07:00
Linus Torvalds	5765040ebf	Merge branch 'x86-smep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-smep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, cpu: Enable/disable Supervisor Mode Execution Protection x86, cpu: Add SMEP CPU feature in CR4 x86, cpufeature: Add cpufeature flag for SMEP	2011-05-19 18:10:17 -07:00
Linus Torvalds	08b5d06ec6	Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Introduce pci_map_biosrom() x86, olpc: Use device tree for platform identification	2011-05-19 18:08:06 -07:00
Linus Torvalds	13588209aa	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (50 commits) x86, mm: Allow ZONE_DMA to be configurable x86, NUMA: Trim numa meminfo with max_pfn in a separate loop x86, NUMA: Rename setup_node_bootmem() to setup_node_data() x86, NUMA: Enable emulation on 32bit too x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too x86, NUMA: Rename amdtopology_64.c to amdtopology.c x86, NUMA: Make numa_init_array() static x86, NUMA: Make 32bit use common NUMA init path x86, NUMA: Initialize and use remap allocator from setup_node_bootmem() x86-32, NUMA: Add @start and @end to init_alloc_remap() x86, NUMA: Remove long 64bit assumption from numa.c x86, NUMA: Enable build of generic NUMA init code on 32bit x86, NUMA: Move NUMA init logic from numa_64.c to numa.c x86-32, NUMA: Update numaq to use new NUMA init protocol x86-32, NUMA: Replace srat_32.c with srat.c x86-32, NUMA: implement temporary NUMA init shims x86, NUMA: Move numa_nodes_parsed to numa.[hc] x86-32, NUMA: Move get_memcfg_numa() into numa_32.c x86, NUMA: make srat.c 32bit safe x86, NUMA: rename srat_64.c to srat.c ...	2011-05-19 18:07:31 -07:00
Linus Torvalds	ac2941f59a	Merge branches 'x86-efi-for-linus', 'x86-gart-for-linus', 'x86-irq-for-linus' and 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, efi: Ensure that the entirity of a region is mapped x86, efi: Pass a minimal map to SetVirtualAddressMap() x86, efi: Merge contiguous memory regions of the same type and attribute x86, efi: Consolidate EFI nx control x86, efi: Remove virtual-mode SetVirtualAddressMap call * 'x86-gart-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, gart: Don't enforce GART aperture lower-bound by alignment * 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Don't unmask disabled irqs when migrating them x86: Skip migrating IRQF_PER_CPU irqs in fixup_irqs() * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, mce: Drop the default decoding notifier x86, MCE: Do not taint when handling correctable errors	2011-05-19 18:03:56 -07:00
Linus Torvalds	0162818804	Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, cpu: Fix detection of Celeron Covington stepping A1 and B0 Documentation, ABI: Update L3 cache index disable text x86, AMD, cacheinfo: Fix L3 cache index disable checks x86, AMD, cacheinfo: Fix fallout caused by max3 conversion x86, cpu: Change NOP selection for certain Intel CPUs x86, cpu: Clean up and unify the NOP selection infrastructure x86, percpu: Use ASM_NOP4 instead of hardcoding P6_NOP4 x86, cpu: Move AMD Elan Kconfig under "Processor family" Fix up trivial conflicts in alternative handling (commit `dc326fca2b` "x86, cpu: Clean up and unify the NOP selection infrastructure" removed some hacky 5-byte instruction stuff, while commit `d430d3d7e6` "jump label: Introduce static_branch() interface" renamed HAVE_JUMP_LABEL to CONFIG_JUMP_LABEL in the code that went away)	2011-05-19 17:55:12 -07:00
Linus Torvalds	17b141803c	Merge branches 'x86-apic-for-linus', 'x86-asm-for-linus' and 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, apic: Print verbose error interrupt reason on apic=debug * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Demacro CONFIG_PARAVIRT cpu accessors * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix mrst sparse complaints x86: Fix spelling error in the memcpy() source code comment x86, mpparse: Remove unnecessary variable	2011-05-19 17:49:35 -07:00
Linus Torvalds	0f1bdc1815	Merge branch 'timers-clocksource-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-clocksource-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: convert mips to generic i8253 clocksource clocksource: convert x86 to generic i8253 clocksource clocksource: convert footbridge to generic i8253 clocksource clocksource: add common i8253 PIT clocksource blackfin: convert to clocksource_register_hz mips: convert to clocksource_register_hz/khz sparc: convert to clocksource_register_hz/khz alpha: convert to clocksource_register_hz microblaze: convert to clocksource_register_hz/khz ia64: convert to clocksource_register_hz/khz x86: Convert remaining x86 clocksources to clocksource_register_hz/khz Make clocksource name const	2011-05-19 17:44:13 -07:00
Linus Torvalds	df48d8716e	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (107 commits) perf stat: Add more cache-miss percentage printouts perf stat: Add -d -d and -d -d -d options to show more CPU events ftrace/kbuild: Add recordmcount files to force full build ftrace: Add self-tests for multiple function trace users ftrace: Modify ftrace_set_filter/notrace to take ops ftrace: Allow dynamically allocated function tracers ftrace: Implement separate user function filtering ftrace: Free hash with call_rcu_sched() ftrace: Have global_ops store the functions that are to be traced ftrace: Add ops parameter to ftrace_startup/shutdown functions ftrace: Add enabled_functions file ftrace: Use counters to enable functions to trace ftrace: Separate hash allocation and assignment ftrace: Create a global_ops to hold the filter and notrace hashes ftrace: Use hash instead for FTRACE_FL_FILTER ftrace: Replace FTRACE_FL_NOTRACE flag with a hash of ignored functions perf bench, x86: Add alternatives-asm.h wrapper x86, 64-bit: Fix copy_[to/from]_user() checks for the userspace address limit x86, mem: memset_64.S: Optimize memset by enhanced REP MOVSB/STOSB x86, mem: memmove_64.S: Optimize memmove by enhanced REP MOVSB/STOSB ...	2011-05-19 17:36:08 -07:00
Linus Torvalds	cbdad8dc18	Merge branch 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, gart: Rename pci-gart_64.c to amd_gart_64.c x86/amd-iommu: Use threaded interupt handler arch/x86/kernel/pci-iommu_table.c: Convert sprintf_symbol to %pS x86/amd-iommu: Add support for invalidate_all command x86/amd-iommu: Add extended feature detection x86/amd-iommu: Add ATS enable/disable code x86/amd-iommu: Add flag to indicate IOTLB support x86/amd-iommu: Flush device IOTLB if ATS is enabled x86/amd-iommu: Select PCI_IOV with AMD IOMMU driver PCI: Move ATS declarations in seperate header file dma-debug: print information about leaked entry x86/amd-iommu: Flush all internal TLBs when IOMMUs are enabled x86/amd-iommu: Rename iommu_flush_device x86/amd-iommu: Improve handling of full command buffer x86/amd-iommu: Rename iommu_flush* to domain_flush* x86/amd-iommu: Remove command buffer resetting logic x86/amd-iommu: Cleanup completion-wait handling x86/amd-iommu: Cleanup inv_pages command handling x86/amd-iommu: Move inv-dte command building to own function x86/amd-iommu: Move compl-wait command building to own function	2011-05-19 17:28:58 -07:00
Linus Torvalds	5318991645	Merge branches 'stable/backend.base.v3' and 'stable/gntalloc.v7' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/backend.base.v3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/pci: Fix compiler error when CONFIG_XEN_PRIVILEGED_GUEST is not set. xen/p2m: Add EXPORT_SYMBOL_GPL to the M2P override functions. xen/p2m/m2p/gnttab: Support GNTMAP_host_map in the M2P override. xen/irq: The Xen hypervisor cleans up the PIRQs if the other domain forgot. xen/irq: Export 'xen_pirq_from_irq' function. xen/irq: Add support to check if IRQ line is shared with other domains. xen/irq: Check if the PCI device is owned by a domain different than DOMID_SELF. xen/pci: Add xen_[find\|register\|unregister]_device_domain_owner functions. * 'stable/gntalloc.v7' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/gntdev,gntalloc: Remove unneeded VM flags	2011-05-19 16:14:25 -07:00
Ingo Molnar	01ed58abec	Merge branch 'x86/mem' into perf/core Merge reason: memcpy_64.S changes an assumption perf bench has, so merge this here so we can fix it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-18 20:59:30 +02:00
Jiri Olsa	26afb7c661	x86, 64-bit: Fix copy_[to/from]_user() checks for the userspace address limit As reported in BZ #30352: https://bugzilla.kernel.org/show_bug.cgi?id=30352 there's a kernel bug related to reading the last allowed page on x86_64. The _copy_to_user() and _copy_from_user() functions use the following check for address limit: if (buf + size >= limit) fail(); while it should be more permissive: if (buf + size > limit) fail(); That's because the size represents the number of bytes being read/write from/to buf address AND including the buf address. So the copy function will actually never touch the limit address even if "buf + size == limit". Following program fails to use the last page as buffer due to the wrong limit check: #include <sys/mman.h> #include <sys/socket.h> #include <assert.h> #define PAGE_SIZE (4096) #define LAST_PAGE ((void)(0x7fffffffe000)) int main() { int fds[2], err; void ptr = mmap(LAST_PAGE, PAGE_SIZE, PROT_READ \| PROT_WRITE, MAP_ANONYMOUS \| MAP_PRIVATE \| MAP_FIXED, -1, 0); assert(ptr == LAST_PAGE); err = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds); assert(err == 0); err = send(fds[0], ptr, PAGE_SIZE, 0); perror("send"); assert(err == PAGE_SIZE); err = recv(fds[1], ptr, PAGE_SIZE, MSG_WAITALL); perror("recv"); assert(err == PAGE_SIZE); return 0; } The other place checking the addr limit is the access_ok() function, which is working properly. There's just a misleading comment for the __range_not_ok() macro - which this patch fixes as well. The last page of the user-space address range is a guard page and Brian Gerst observed that the guard page itself due to an erratum on K8 cpus (#121 Sequential Execution Across Non-Canonical Boundary Causes Processor Hang). However, the test code is using the last valid page before the guard page. The bug is that the last byte before the guard page can't be read because of the off-by-one error. The guard page is left in place. This bug would normally not show up because the last page is part of the process stack and never accessed via syscalls. Signed-off-by: Jiri Olsa <jolsa@redhat.com> Acked-by: Brian Gerst <brgerst@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1305210630-7136-1-git-send-email-jolsa@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-18 12:49:00 +02:00
Linus Torvalds	39dcfa552c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, AMD: Fix ARAT feature setting again Revert "x86, AMD: Fix APIC timer erratum 400 affecting K8 Rev.A-E processors" x86, apic: Fix spurious error interrupts triggering on all non-boot APs x86, mce, AMD: Fix leaving freed data in a list x86: Fix UV BAU for non-consecutive nasids x86, UV: Fix NMI handler for UV platforms	2011-05-18 03:14:34 -07:00
Fenghua Yu	dc23c0bccf	x86, cpu: Add SMEP CPU feature in CR4 Add support for newly documented SMEP (Supervisor Mode Execution Protection) CPU feature in CR4. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> LKML-Reference: <1305683069-25394-3-git-send-email-fenghua.yu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-05-17 21:06:42 -07:00
Fenghua Yu	d0281a257f	x86, cpufeature: Add cpufeature flag for SMEP Add support for newly documented SMEP (Supervisor Mode Execution Protection) CPU feature flag. SMEP prevents the CPU in kernel-mode to jump to an executable page that has the user flag set in the PTE. This prevents the kernel from executing user-space code accidentally or maliciously, so it for example prevents kernel exploits from jumping to specially prepared user-mode shell code. [ hpa: added better description by Ingo Molnar ] Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> LKML-Reference: <1305683069-25394-2-git-send-email-fenghua.yu@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-05-17 20:56:59 -07:00
Fenghua Yu	9072d11da1	x86, alternative: Add altinstruction_entry macro Add altinstruction_entry macro to generate .altinstructions section entries from assembly code. This should be less failure-prone than open-coding. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1305671358-14478-5-git-send-email-fenghua.yu@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-05-17 15:40:25 -07:00
Fenghua Yu	724a92ee45	x86, cpufeature: Add CPU feature bit for enhanced REP MOVSB/STOSB Intel processors are adding enhancements to REP MOVSB/STOSB and the use of REP MOVSB/STOSB for optimal memcpy/memset or similar functions is recommended. Enhancement availability is indicated by CPUID.7.0.EBX[9] (Enhanced REP MOVSB/ STOSB). Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1305671358-14478-2-git-send-email-fenghua.yu@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-05-17 14:56:36 -07:00
Martin Schwidefsky	521ccb5c4a	ftrace/x86: mcount offset calculation Do the mcount offset adjustment in the recordmcount.pl/recordmcount.[ch] at compile time and not in ftrace_call_adjust at run time. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-16 14:55:57 -04:00
Steven Rostedt	2895cd2ab8	ftrace/x86: Do not trace .discard.text section The section called .discard.text has tracing attached to it and is currently ignored by ftrace. But it does include a call to the mcount stub. Adding a notrace to the code keeps gcc from adding the useless mcount caller to it. Link: http://lkml.kernel.org/r/20110421023739.243651696@goodmis.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-05-16 14:47:13 -04:00
Youquan Song	e503f9e4b0	x86, apic: Fix spurious error interrupts triggering on all non-boot APs This patch fixes a bug reported by a customer, who found that many unreasonable error interrupts reported on all non-boot CPUs (APs) during the system boot stage. According to Chapter 10 of Intel Software Developer Manual Volume 3A, Local APIC may signal an illegal vector error when an LVT entry is set as an illegal vector value (0~15) under FIXED delivery mode (bits 8-11 is 0), regardless of whether the mask bit is set or an interrupt actually happen. These errors are seen as error interrupts. The initial value of thermal LVT entries on all APs always reads 0x10000 because APs are woken up by BSP issuing INIT-SIPI-SIPI sequence to them and LVT registers are reset to 0s except for the mask bits which are set to 1s when APs receive INIT IPI. When the BIOS takes over the thermal throttling interrupt, the LVT thermal deliver mode should be SMI and it is required from the kernel to keep AP's LVT thermal monitoring register programmed as such as well. This issue happens when BIOS does not take over thermal throttling interrupt, AP's LVT thermal monitor register will be restored to 0x10000 which means vector 0 and fixed deliver mode, so all APs will signal illegal vector error interrupts. This patch check if interrupt delivery mode is not fixed mode before restoring AP's LVT thermal monitor register. Signed-off-by: Youquan Song <youquan.song@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Yong Wang <yong.y.wang@intel.com> Cc: hpa@linux.intel.com Cc: joe@perches.com Cc: jbaron@redhat.com Cc: trenn@suse.de Cc: kent.liu@intel.com Cc: chaohong.guo@intel.com Cc: <stable@kernel.org> # As far back as possible Link: http://lkml.kernel.org/r/1303402963-17738-1-git-send-email-youquan.song@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-16 13:48:25 +02:00
Russell King	82491451dd	clocksource: convert x86 to generic i8253 clocksource Convert x86 i8253 clocksource code to use generic i8253 clocksource. Acked-by: John Stultz <john.stultz@linaro.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2011-05-14 10:29:48 +01:00
Cliff Wickman	77ed23f8d9	x86: Fix UV BAU for non-consecutive nasids This is a fix for the SGI Altix-UV Broadcast Assist Unit code, which is used for TLB flushing. Certain hardware configurations (that customers are ordering) cause nasids (numa address space id's) to be non-consecutive. Specifically, once you have more than 4 blades in a IRU (Individual Rack Unit - or 1/2 rack) but less than the maximum of 16, the nasid numbering becomes non-consecutive. This currently results in a 'catastrophic error' (CATERR) detected by the firmware during OS boot. The BAU is generating an 'INTD' request that is targeting a non-existent nasid value. Such configurations may also occur when a blade is configured off because of hardware errors. (There is one UV hub per blade.) This patch is required to support such configurations. The problem with the tlb_uv.c code is that is using the consecutive hub numbers as indices to the BAU distribution bit map. These are simply the ordinal position of the hub or blade within its partition. It should be using physical node numbers (pnodes), which correspond to the physical nasid values. Use of the hub number only works as long as the nasids in the partition are consecutive and increase with a stride of 1. This patch changes the index to be the pnode number, thus allowing nasids to be non-consecutive. It also provides a table in local memory for each cpu to translate target cpu number to target pnode and nasid. And it improves naming to properly reflect 'node' and 'uvhub' versus 'nasid'. Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/E1QJmxX-0002Mz-Fk@eag09.americas.sgi.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-12 23:45:42 +02:00
Stefano Stabellini	279b706bf8	x86,xen: introduce x86_init.mapping.pagetable_reserve Introduce a new x86_init hook called pagetable_reserve that at the end of init_memory_mapping is used to reserve a range of memory addresses for the kernel pagetable pages we used and free the other ones. On native it just calls memblock_x86_reserve_range while on xen it also takes care of setting the spare memory previously allocated for kernel pagetable pages from RO to RW, so that it can be used for other purposes. A detailed explanation of the reason why this hook is needed follows. As a consequence of the commit: commit `4b239f458c` Author: Yinghai Lu <yinghai@kernel.org> Date: Fri Dec 17 16:58:28 2010 -0800 x86-64, mm: Put early page table high at some point init_memory_mapping is going to reach the pagetable pages area and map those pages too (mapping them as normal memory that falls in the range of addresses passed to init_memory_mapping as argument). Some of those pages are already pagetable pages (they are in the range pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and everything is fine. Some of these pages are not pagetable pages yet (they fall in the range pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they are going to be mapped RW. When these pages become pagetable pages and are hooked into the pagetable, xen will find that the guest has already a RW mapping of them somewhere and fail the operation. The reason Xen requires pagetables to be RO is that the hypervisor needs to verify that the pagetables are valid before using them. The validation operations are called "pinning" (more details in arch/x86/xen/mmu.c). In order to fix the issue we mark all the pages in the entire range pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation is completed only the range pgt_buf_start-pgt_buf_end is reserved by init_memory_mapping. Hence the kernel is going to crash as soon as one of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those ranges are RO). For this reason we need a hook to reserve the kernel pagetable pages we used and free the other ones so that they can be reused for other purposes. On native it just means calling memblock_x86_reserve_range, on Xen it also means marking RW the pagetable pages that we allocated before but that haven't been used before. Another way to fix this is without using the hook is by adding a 'if (xen_pv_domain)' in the 'init_memory_mapping' code and calling the Xen counterpart, but that is just nasty. Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-05-12 13:05:04 -04:00
Richard Weinberger	449a66fd1f	x86: Remove warning and warning_symbol from struct stacktrace_ops Both warning and warning_symbol are nowhere used. Let's get rid of them. Signed-off-by: Richard Weinberger <richard@nod.at> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Soeren Sandmann Pedersen <ssp@redhat.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: x86 <x86@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Robert Richter <robert.richter@amd.com> Cc: Paul Mundt <lethal@linux-sh.org> Link: http://lkml.kernel.org/r/1305205872-10321-2-git-send-email-richard@nod.at Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>	2011-05-12 15:31:28 +02:00
Avi Kivity	ca1d4a9e77	KVM: x86 emulator: drop vcpu argument from pio callbacks Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:11 -04:00
Avi Kivity	0f65dd70a4	KVM: x86 emulator: drop vcpu argument from memory read/write callbacks Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Nelson Elhage	3d9b938eef	KVM: emulator: Use linearize() when fetching instructions Since segments need to be handled slightly differently when fetching instructions, we add a __linearize helper that accepts a new 'fetch' boolean. [avi: fix oops caused by wrong segmented_address initialization order] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Joerg Roedel	7c4c0f4fd5	KVM: X86: Update last_guest_tsc in vcpu_put The last_guest_tsc is used in vcpu_load to adjust the tsc_offset since tsc-scaling is merged. So the last_guest_tsc needs to be updated in vcpu_put instead of the the last_host_tsc. This is fixed with this patch. Reported-by: Jan Kiszka <jan.kiszka@web.de> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Gleb Natapov	7ae441eac5	KVM: emulator: do not needlesly sync registers from emulator ctxt to vcpu Currently we sync registers back and forth before/after exiting to userspace for IO, but during IO device model shouldn't need to read/write the registers, so we can as well skip those sync points. The only exaception is broken vmware backdor interface. The new code sync registers content during IO only if registers are read from/written to by userspace in the middle of the IO operation and this almost never happens in practise. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-11 07:57:08 -04:00
Joerg Roedel	92a1f12d25	KVM: X86: Implement userspace interface to set virtual_tsc_khz This patch implements two new vm-ioctls to get and set the virtual_tsc_khz if the machine supports tsc-scaling. Setting the tsc-frequency is only possible before userspace creates any vcpu. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:06 -04:00
Joerg Roedel	857e40999e	KVM: X86: Delegate tsc-offset calculation to architecture code With TSC scaling in SVM the tsc-offset needs to be calculated differently. This patch propagates this calculation into the architecture specific modules so that this complexity can be handled there. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:05 -04:00
Joerg Roedel	4051b18801	KVM: X86: Implement call-back to propagate virtual_tsc_khz This patch implements a call-back into the architecture code to allow the propagation of changes to the virtual tsc_khz of the vcpu. On SVM it updates the tsc_ratio variable, on VMX it does nothing. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:05 -04:00
Joerg Roedel	1e993611d0	KVM: X86: Let kvm-clock report the right tsc frequency This patch changes the kvm_guest_time_update function to use TSC frequency the guest actually has for updating its clock. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Joerg Roedel	fbc0db76b7	KVM: SVM: Implement infrastructure for TSC_RATE_MSR This patch enhances the kvm_amd module with functions to support the TSC_RATE_MSR which can be used to set a given tsc frequency for the guest vcpu. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Xiao Guangrong	7c5625227f	KVM: MMU: remove mmu_seq verification on pte update path The mmu_seq verification can be removed since we get the pfn in the protection of mmu_lock. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Joerg Roedel	f6511935f4	KVM: SVM: Add checks for IO instructions This patch adds code to check for IOIO intercepts on instructions decoded by the KVM instruction emulator. [avi: fix build error due to missing #define D2bvIP] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Joerg Roedel	8061252ee0	KVM: SVM: Add intercept checks for remaining twobyte instructions This patch adds intercepts checks for the remaining twobyte instructions to the KVM instruction emulator. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	3b88e41a41	KVM: SVM: Add intercept check for accessing dr registers This patch adds the intercept checks for instruction accessing the debug registers. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	cfec82cb7d	KVM: SVM: Add intercept check for emulated cr accesses This patch adds all necessary intercept checks for instructions that access the crX registers. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	8a76d7f25f	KVM: x86: Add x86 callback for intercept check This patch adds a callback into kvm_x86_ops so that svm and vmx code can do intercept checks on emulated instructions. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	8ea7d6aef8	KVM: x86 emulator: Add flag to check for protected mode instructions This patch adds a flag for the opcoded to tag instruction which are only recognized in protected mode. The necessary check is added too. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	d09beabd7c	KVM: x86 emulator: Add check_perm callback This patch adds a check_perm callback for each opcode into the instruction emulator. This will be used to do all necessary permission checks on instructions before checking whether they are intercepted or not. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	775fde8648	KVM: x86 emulator: Don't write-back cpu-state on X86EMUL_INTERCEPTED This patch prevents the changed CPU state to be written back when the emulator detected that the instruction was intercepted by the guest. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	3c6e276f22	KVM: x86 emulator: add SVM intercepts Add intercept codes for instructions defined by SVM as interceptable. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	c4f035c60d	KVM: x86 emulator: add framework for instruction intercepts When running in guest mode, certain instructions can be intercepted by hardware. This also holds for nested guests running on emulated virtualization hardware, in particular instructions emulated by kvm itself. This patch adds a framework for intercepting instructions. If an instruction is marked for interception, and if we're running in guest mode, a callback is called to check whether an intercept is needed or not. The callback is called at three points in time: immediately after beginning execution, after checking privilge exceptions, and after checking memory exception. This suits the different interception points defined for different instructions and for the various virtualization instruction sets. In addition, a new X86EMUL_INTERCEPT is defined, which any callback or memory access may define, allowing the more complicated intercepts to be implemented in existing callbacks. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	1253791df9	KVM: x86 emulator: SSE support Add support for marking an instruction as SSE, switching registers used to the SSE register file. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:59 -04:00
Avi Kivity	5037f6f324	KVM: x86 emulator: define callbacks for using the guest fpu within the emulator Needed for emulating fpu instructions. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	1d6b114f20	KVM: x86 emulator: do not munge rep prefix Currently we store a rep prefix as 1 or 2 depending on whether it is a REPE or REPNE. Since sse instructions depend on the prefix value, store it as the original opcode to simplify things further on. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	cef4dea07f	KVM: 16-byte mmio support Since sse instructions can issue 16-byte mmios, we need to support them. We can't increase the kvm_run mmio buffer size to 16 bytes without breaking compatibility, so instead we break the large mmios into two smaller 8-byte ones. Since the bus is 64-bit we aren't breaking any atomicity guarantees. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	69c7302890	KVM: VMX: Cache cpl We may read the cpl quite often in the same vmexit (instruction privilege check, memory access checks for instruction and operands), so we gain a bit if we cache the value. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Avi Kivity	6de12732c4	KVM: VMX: Optimize vmx_get_rflags() If called several times within the same exit, return cached results. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Yinghai Lu	5491ff511d	x86/PCI: Remove dma32_reserve_bootmem This workaround holds a dma32 buffer at early boot to prevent later bootmem allocations from stealing it in the case of large RAM configs. Now that x86 is using memblock, and the nobootmem wrapper does top-down allocation, it's no longer necessary, so remove it. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>	2011-05-10 15:43:32 -07:00
Ingo Molnar	932fed4e2e	Merge commit 'v2.6.39-rc7' into perf/core Merge reason: pull in the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-10 17:05:45 +02:00
Joerg Roedel	72fe00f01f	x86/amd-iommu: Use threaded interupt handler Move the interupt handling for the iommu into the interupt thread to reduce latencies and prepare interupt handling for pri handling. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-05-10 11:07:58 +02:00
Joerg Roedel	604c307bf4	Merge branches 'dma-debug/next', 'amd-iommu/command-cleanups', 'amd-iommu/ats' and 'amd-iommu/extended-features' into iommu/2.6.40 Conflicts: arch/x86/include/asm/amd_iommu_types.h arch/x86/kernel/amd_iommu.c arch/x86/kernel/amd_iommu_init.c	2011-05-10 10:25:23 +02:00
Jack Steiner	1d44e8288a	x86, UV: Fix NMI handler for UV platforms This fixes problems seen on UV systems handling NMIs from the node controller. I isolated the "dazed..." messages that I saw earlier to a bug in the BMC on our platform. It was sending NMIs w/o properly setting a register that indicated the source of NMI. So rather than _assuming_ any unhandled NMI came from the UV system maintenance console (SMC), add a check to verify that the SMC actually sent the NMI. Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: gorcunov@gmail.com Cc: dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-05-10 09:26:55 +02:00
Matthew Garrett	9cd2b07c19	x86, efi: Consolidate EFI nx control The core EFI code and 64-bit EFI code currently have independent implementations of code for setting memory regions as executable or not. Let's consolidate them. Signed-off-by: Matthew Garrett <mjg@redhat.com> Link: http://lkml.kernel.org/r/1304623186-18261-2-git-send-email-mjg@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-05-09 12:14:29 -07:00
Anton Blanchard	228e548e60	net: Add sendmmsg socket system call This patch adds a multiple message send syscall and is the send version of the existing recvmmsg syscall. This is heavily based on the patch by Arnaldo that added recvmmsg. I wrote a microbenchmark to test the performance gains of using this new syscall: http://ozlabs.org/~anton/junkcode/sendmmsg_test.c The test was run on a ppc64 box with a 10 Gbit network card. The benchmark can send both UDP and RAW ethernet packets. 64B UDP batch pkts/sec 1 804570 2 872800 (+ 8 %) 4 916556 (+14 %) 8 939712 (+17 %) 16 952688 (+18 %) 32 956448 (+19 %) 64 964800 (+20 %) 64B raw socket batch pkts/sec 1 1201449 2 1350028 (+12 %) 4 1461416 (+22 %) 8 `1513080` (+26 %) 16 1541216 (+28 %) 32 1553440 (+29 %) 64 1557888 (+30 %) We see a 20% improvement in throughput on UDP send and 30% on raw socket send. [ Add sparc syscall entries. -DaveM ] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-05 11:10:14 -07:00
Tejun Heo	1b7e03ef75	x86, NUMA: Enable emulation on 32bit too Now that NUMA init path is unified, NUMA emulation can be enabled on 32bit. Make numa_emluation.c safe on 32bit by doing the followings. * Define MAX_DMA32_PFN on 32bit too. * Include bootmem.h for max_pfn declaration. * Use u64 explicitly and always use PFN_PHYS() when converting page number to address. * Avoid __udivdi3() generation on 32bit by doing number of pages calculation instead in split_nodes_interleave(). And drop X86_64 dependency from Kconfig. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 17:24:48 +02:00
Tejun Heo	752d4f372f	x86, NUMA: Make numa_init_array() static numa_init_array() no longer has users outside of numa.c. Make it static. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 17:24:48 +02:00
Tejun Heo	bd6709a91a	x86, NUMA: Make 32bit use common NUMA init path With both _numa_init() methods converted and the rest of init code adjusted, numa_32.c now can switch from the 32bit only init code to the common one in numa.c. * Shim get_memcfg_()'s are dropped and initmem_init() calls x86_numa_init(), which is updated to handle NUMAQ. All boilerplate operations including node range limiting, pgdat alloc/init are handled by numa_init(). 32bit only implementation is removed. * 32bit numa_add_memblk(), numa_set_distance() and memory_add_physaddr_to_nid() removed and common versions in numa_32.c enabled for 32bit. This change causes the following behavior changes. * NODE_DATA()->node_start_pfn/node_spanned_pages properly initialized for 32bit too. * Much more sanity checks and configuration cleanups. * Proper handling of node distances. * The same NUMA init messages as 64bit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 17:24:48 +02:00
Tejun Heo	744baba0c4	x86, NUMA: Enable build of generic NUMA init code on 32bit Generic NUMA init code was moved to numa.c from numa_64.c but is still guaraded by CONFIG_X86_64. This patch removes the compile guard and enables compiling on 32bit. * numa_add_memblk() and numa_set_distance() clash with the shim implementation in numa_32.c and are left out. * memory_add_physaddr_to_nid() clashes with 32bit implementation and is left out. * MAX_DMA_PFN definition in dma.h moved out of !CONFIG_X86_32. * node_data definition in numa_32.c removed in favor of the one in numa.c. There are places where ulong is assumed to be 64bit. The next patch will fix them up. Note that although the code is compiled it isn't used yet and this patch doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	a4106eae65	x86, NUMA: Move NUMA init logic from numa_64.c to numa.c Move the generic 64bit NUMA init machinery from numa_64.c to numa.c. * node_data[], numa_mem_info and numa_distance * numa_add_memblk[_to](), numa_remove_memblk[_from]() * numa_set_distance() and friends * numa_init() and all the numa_meminfo handling helpers called from it * dummy_numa_init() * memory_add_physaddr_to_nid() A new function x86_numa_init() is added and the content of numa_64.c::initmem_init() is moved into it. initmem_init() now simply calls x86_numa_init(). Constants and numa_off declaration are moved from numa_{32\|64}.h to numa.h. This is code reorganization and doesn't involve any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	299a180aec	x86-32, NUMA: Update numaq to use new NUMA init protocol Update numaq such that it calls numa_add_memblk() and sets numa_nodes_parsed instead of directly diddling with NUMA states. The original get_memcfg_numaq() is renamed to numaq_numa_init() and new get_memcfg_numaq() is created in numa_32.c. The shim numa_add_memblk() implementation handles node_start/end_pfn[] and node_set_online() for nodes with memory. The new get_memcfg_numaq() exactly the same with get_memcfg_from_srat() other than calling the numaq init function. Things get_memcfgs_numaq() do are not strictly necessary for numaq but added for consistency and to help unifying NUMA init handling. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	5acd91ab83	x86-32, NUMA: Replace srat_32.c with srat.c SRAT support implementation in srat_32.c and srat.c are generally similar; however, there are some differences. First of all, 64bit implementation supports more types of SRAT entries. 64bit supports x2apic, affinity, memory and SLIT. 32bit only supports processor and memory. Most other differences stem from different initialization protocols employed by 64bit and 32bit NUMA init paths. On 64bit, * Mappings among PXM, node and apicid are directly done in each SRAT entry callback. * Memory affinity information is passed to numa_add_memblk() which takes care of all interfacing with NUMA init. * Doesn't directly initialize NUMA configurations. All the information is recorded in numa_nodes_parsed and memblks. On 32bit, * Checks numa_off. * Things go through one more level of indirection via private tables but eventually end up initializing the same mappings. * node_start/end_pfn[] are initialized and memblock_x86_register_active_regions() is called for each memory chunk. * node_set_online() is called for each online node. * sort_node_map() is called. There are also other minor differences in sanity checking and messages but taking 64bit version should be good enough. This patch drops the 32bit specific implementation and makes the 64bit implementation common for both 32 and 64bit. The init protocol differences are dealt with in two places - the numa_add_memblk() shim added in the previous patch and new temporary numa_32.c:get_memcfg_from_srat() which wraps invocation of x86_acpi_numa_init(). The shim numa_add_memblk() handles the folowings. * node_start/end_pfn[] initialization. * node_set_online() for memory nodes. * Invocation of memblock_x86_register_active_regions(). The shim get_memcfg_from_srat() handles the followings. * numa_off check. * node_set_online() for CPU nodes. * sort_node_map() invocation. * Clearing of numa_nodes_parsed and active_ranges on failure. The shims are temporary and will be removed as the generic NUMA init path in 32bit is replaced with 64bit one. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	b0d310801a	x86-32, NUMA: implement temporary NUMA init shims To help transition to common NUMA init, implement temporary 32bit shims for numa_add_memblk() and numa_set_distance(). numa_add_memblk() registers the memblk and adjusts node_start/end_pfn[]. numa_set_distance() is noop. These shims will allow using 64bit NUMA init functions on 32bit and gradual transition to common NUMA init path. For detailed description, please read description of commits which make use of the shim functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	e6df595b37	x86, NUMA: Move numa_nodes_parsed to numa.[hc] Move numa_nodes_parsed from numa_64.[hc] to numa.[hc] to prepare for NUMA init path unification. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	daf4f480ae	x86-32, NUMA: Move get_memcfg_numa() into numa_32.c There's no reason get_memcfg_numa() to be implemented inline in mmzone_32.h. Move it to numa_32.c and also make get_memcfg_numa_flag() static. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:53 +02:00
Tejun Heo	1201e10a09	x86, NUMA: trivial cleanups * Kill no longer used struct bootnode. * Kill dangling declaration of pxm_to_nid() in numa_32.h. * Make setup_node_bootmem() static. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:52 +02:00
Tejun Heo	84914ed0ec	x86-32, NUMA: Make apic->x86_32_numa_cpu_node() optional NUMAQ is the only meaningful user of this callback and setup_local_APIC() the only callsite. Stop torturing everyone else by making the callback optional and removing all the boilerplate implementations and assignments. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:52 +02:00
Tejun Heo	6bd262731b	x86, NUMA: Unify 32/64bit numa_cpu_node() implementation Currently, the only meaningful user of apic->x86_32_numa_cpu_node() is NUMAQ which returns valid mapping only after CPU is initialized during SMP bringup; thus, the previous patch to set apicid -> node in setup_local_APIC() makes __apicid_to_node[] always contain the correct mapping whether custom apic->x86_32_numa_cpu_node() is used or not. So, there is no reason to keep separate 32bit implementation. We can always consult __apicid_to_node[]. Move 64bit implementation from numa_64.c to numa.c and remove 32bit implementation from numa_32.c. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com>	2011-05-02 14:18:52 +02:00
Tejun Heo	ba67cf5cf2	Merge branch 'x86/urgent' into x86-mm Merge reason: Pick up the following two fix commits. 2be19102b7: x86, NUMA: Fix empty memblk detection in numa_cleanup_meminfo() 765af22da8: x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change Scheduled NUMA init 32/64bit unification changes depend on these. Signed-off-by: Tejun Heo <tj@kernel.org>	2011-05-02 14:16:47 +02:00
Tejun Heo	aff364860a	Merge branch 'x86/numa' into x86-mm Merge reason: Pick up x86-32 remap allocator cleanup changes - 14 commits, 3fe14ab541^..993ba1585c. 3fe14ab541: x86-32, numa: Fix failure condition check in alloc_remap() 993ba1585c: x86-32, numa: Update remap allocator comments Scheduled NUMA init 32/64bit unification changes depend on them. Signed-off-by: Tejun Heo <tj@kernel.org>	2011-05-02 14:08:47 +02:00
Mike Waychison	f548ccd47d	x86: Better comments for get_bios_ebda() Make the comments a bit clearer for get_bios_ebda so that it actually tells us what it is returning. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-29 14:13:15 -07:00
Mike Waychison	57d5f9f808	x86: get_bios_ebda_length() Add a wrapper routine that tells us the length of the EBDA if it is present. This guy also ensures that the returned length doesn't let the EBDA run past the 640KiB mark. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2011-04-29 14:13:15 -07:00
Sebastian Andrzej Siewior	20443598d9	x86: devicetree: Configure IOAPIC pin only once We use io_apic_setup_irq_pin() in order to configure pin's interrupt number polarity and type. This is done on every irq_create_of_mapping() which happens for instance during pci enable calls. Level typed interrupts are masked by default, edge are unmasked. On the first ->xlate() call the level interrupt is configured and masked. The driver calls request_irq() and the line is unmasked. Lets assume the interrupt line is shared with another device and we call pci_enable_device() for this device. The ->xlate() configures the pin again and it is masked. request_irq() does not unmask the line because it _is_ already unmasked according to its internal state. So the interrupt will never be unmasked again. This patch is based on an earlier work by Torben Hohn and solves the problem by configuring the pin only once. Since all devices must agree on the same type and polarity there is no point in configuring the pin more than once. [ tglx: Split out the ce4100 part into a separate patch ] Cc: Torben Hohn <torbenh@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/%3C20110427143052.GA15211%40linutronix.de%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-04-28 11:38:30 +02:00
Ingo Molnar	32673822e4	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core Conflicts: include/linux/perf_event.h Merge reason: pick up the latest jump-label enhancements, they are cooked ready. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-27 10:40:21 +02:00
Avi Kivity	15d6aba24d	x86: Demacro CONFIG_PARAVIRT cpu accessors Recently, we had a build failure on !CONFIG_PARAVIRT due to a callback ->wbinvd() clashing with a macro wbinvd(). While we worked around the issue, avoid it in the future by changing the macro (and a few surrounding ones) to an inline function. Signed-off-by: Avi Kivity <avi@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1303632711-21662-1-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-24 13:20:36 +02:00
Borislav Petkov	dffa4b2f62	x86, mce: Drop the default decoding notifier The default notifier doesn't make a lot of sense to call in the correctable errors case. Drop it and emit the mcelog decoding hint only in the uncorrectable errors case and when no notifier is registered. Also, limit issuing the "mcelog --ascii" message in the rare case when we dump unreported CEs before panicking. While at it, remove unused old x86_mce_decode_callback from the header. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com> Cc: Russ Anderson <rja@sgi.com> Link: http://lkml.kernel.org/r/20110420102349.GB1361@aftab Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-21 11:35:10 +02:00
David Rientjes	7a6c654782	x86, numa: Fix cpu nodemasks for NUMA emulation and CONFIG_DEBUG_PER_CPU_MAPS The cpu<->node mappings under CONFIG_DEBUG_PER_CPU_MAPS=y when NUMA emulation is enabled is currently broken because it does not iterate through every emulated node and bind cpus that have affinity to it. NUMA emulation should bind each cpu to every local node to accurately represent the true NUMA topology of the underlying machine. debug_cpumask_set_cpu() needs to be fixed at the same time so that the debugging information that it emits shows the new cpumask of the node being assigned when the cpu is being added or removed. It can now take responsibility of setting or clearing the cpu itself to remove the need for duplicate code. Also change its last parameter, "enable", to have the correct bool type since it can only be true or false. -v2: Fix the return statements, by Kosaki Motohiro Acked-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andreas Herrmann <herrmann.der.user@googlemail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918470.12634@chino.kir.corp.google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-04-21 11:31:00 +02:00
H. Peter Anvin	dc326fca2b	x86, cpu: Clean up and unify the NOP selection infrastructure Clean up and unify the NOP selection infrastructure: - Make the atomic 5-byte NOP a part of the selection system. - Pick NOPs once during early boot and then be done with it. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Link: http://lkml.kernel.org/r/1303166160-10315-3-git-send-email-hpa@linux.intel.com	2011-04-18 16:40:21 -07:00
H. Peter Anvin	b1e7734f02	x86, percpu: Use ASM_NOP4 instead of hardcoding P6_NOP4 For use in assembly constants, use the ASM_NOP* defines. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1303166160-10315-2-git-send-email-hpa@linux.intel.com	2011-04-18 16:40:06 -07:00
Joerg Roedel	c34151a742	x86, gart: Set DISTLBWALKPRB bit always The DISTLBWALKPRB bit must be set for the GART because the gatt table is mapped UC. But the current code does not set the bit at boot when the BIOS setup the aperture correctly. Fix that by setting this bit when enabling the GART instead of the other places. Cc: <stable@kernel.org> Cc: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/1303134346-5805-4-git-send-email-joerg.roedel@amd.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-04-18 09:26:48 -07:00
Joerg Roedel	af289bfe15	x86, gart: Convert spaces to tabs in enable_gart_translation Probably by copy&paste this function was indented by spaces. Convert this to tabs. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/1303134346-5805-3-git-send-email-joerg.roedel@amd.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-04-18 09:26:48 -07:00
Konrad Rzeszutek Wilk	cf8d91633d	xen/p2m/m2p/gnttab: Support GNTMAP_host_map in the M2P override. We only supported the M2P (and P2M) override only for the GNTMAP_contains_pte type mappings. Meaning that we grants operations would "contain the machine address of the PTE to update" If the flag is unset, then the grant operation is "contains a host virtual address". The latter case means that the Hypervisor takes care of updating our page table (specifically the PTE entry) with the guest's MFN. As such we should not try to do anything with the PTE. Previous to this patch we would try to clear the PTE which resulted in Xen hypervisor being upset with us: (XEN) mm.c:1066:d0 Attempt to implicitly unmap a granted PTE c0100000ccc59067 (XEN) domain_crash called from mm.c:1067 (XEN) Domain 0 (vcpu#0) crashed on cpu#3: (XEN) ----[ Xen-4.0-110228 x86_64 debug=y Not tainted ]---- and crashing us. This patch allows us to inhibit the PTE clearing in the PV guest if the GNTMAP_contains_pte is not set. On the m2p_remove_override path we provide the same parameter. Sadly in the grant-table driver we do not have a mechanism to tell m2p_remove_override whether to clear the PTE or not. Since the grant-table driver is used by user-space, we can safely assume that it operates only on PTE's. Hence the implementation for it to work on !GNTMAP_contains_pte returns -EOPNOTSUPP. In the future we can implement the support for this. It will require some extra accounting structure to keep track of the page[i], and the flag. [v1: Added documentation details, made it return -EOPNOTSUPP instead of trying to do a half-way implementation] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-04-18 11:10:27 -04:00
Joerg Roedel	5bbc097d89	x86, amd: Disable GartTlbWlkErr when BIOS forgets it This patch disables GartTlbWlk errors on AMD Fam10h CPUs if the BIOS forgets to do is (or is just too old). Letting these errors enabled can cause a sync-flood on the CPU causing a reboot. The AMD BKDG recommends disabling GART TLB Wlk Error completely. This patch is the fix for https://bugzilla.kernel.org/show_bug.cgi?id=33012 on my machine. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/20110415131152.GJ18463@8bytes.org Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com> Cc: <stable@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-04-15 16:03:16 -07:00
Konrad Rzeszutek Wilk	c55fa78b13	xen/pci: Add xen_[find\|register\|unregister]_device_domain_owner functions. When the Xen PCI backend is told to enable or disable MSI/MSI-X functions, the initial domain performs these operations. The initial domain needs to know which domain (guest) is going to use the PCI device so when it makes the appropiate hypercall to retrieve the MSI/MSI-X vector it will also assign the PCI device to the appropiate domain (guest). This boils down to us needing a mechanism to find, set and unset the domain id that will be using the device. [v2: EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL.] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-04-14 11:16:54 -04:00
Joerg Roedel	58fc7f1419	x86/amd-iommu: Add support for invalidate_all command This patch adds support for the invalidate_all command present in new versions of the AMD IOMMU. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-12 09:21:58 +02:00
Joerg Roedel	d99ddec3ee	x86/amd-iommu: Add extended feature detection This patch adds detection of the extended features of an AMD IOMMU. The available features are printed to dmesg on boot. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-12 09:21:51 +02:00
Joerg Roedel	fd7b5535e1	x86/amd-iommu: Add ATS enable/disable code This patch adds the necessary code to the AMD IOMMU driver for enabling and disabling the ATS capability on a device and to setup the IOMMU data structures correctly. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-11 09:04:04 +02:00
Joerg Roedel	60f723b411	x86/amd-iommu: Add flag to indicate IOTLB support This patch adds a flag to the AMD IOMMU driver to indicate that all IOMMUs present in the system support device IOTLBs. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-11 09:04:03 +02:00
Joerg Roedel	cb41ed85ef	x86/amd-iommu: Flush device IOTLB if ATS is enabled This patch implements a function to flush the IOTLB on devices supporting ATS and makes sure that this TLB is also flushed if necessary. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-11 09:04:03 +02:00
Ian Campbell	ce9c99af8d	x86, cpu: Move AMD Elan Kconfig under "Processor family" Currently the option resides under X86_EXTENDED_PLATFORM due to historical nonstandard A20M# handling. However that is no longer the case and so Elan can be treated as part of the standard processor choice Kconfig option. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Link: http://lkml.kernel.org/r/1302245177.31620.47.camel@localhost.localdomain Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-04-08 13:01:25 -07:00
Joerg Roedel	7d0c5cc5be	x86/amd-iommu: Flush all internal TLBs when IOMMUs are enabled The old code only flushed a DTE or a domain TLB before it is actually used by the IOMMU driver. While this is efficient and works when done right it is more likely to introduce new bugs when changing code (which happened in the past). This patch adds code to flush all DTEs and all domain TLBs in each IOMMU right after it is enabled (at boot and after resume). This reduces the complexity of the driver and makes it less likely to introduce stale-TLB bugs in the future. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-07 11:04:32 +02:00
Joerg Roedel	61985a040f	x86/amd-iommu: Remove command buffer resetting logic The logic to reset the command buffer caused more problems than it actually helped. The logic jumped in when the IOMMU hardware doesn't execute commands anymore but the reasons for this are usually not fixed by just resetting the command buffer. So the code can be removed to reduce complexity. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-04-07 10:46:05 +02:00
Tejun Heo	7210cf9217	x86-32, numa: Calculate remap size in common code Only pgdat and memmap use remap area and there isn't much benefit in allowing per-node override. In addition, the use of node_remap_size[] is confusing in that it contains number of bytes before remap initialization and then number of pages afterwards. Move remap size calculation for memap from specific NUMA config implementations to init_alloc_remap() and make node_remap_size[] static. The only behavior difference is that, before this patch, numaq_32 didn't consider max_pfn when calculating the memmap size but it's enforced after this patch, which is the right thing to do. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1301955840-7246-8-git-send-email-tj@kernel.org Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-04-06 17:57:16 -07:00
Hans Rosenfeld	f994d99cf1	x86-32, fpu: Fix FPU exception handling on non-SSE systems On 32bit systems without SSE (that is, they use FSAVE/FRSTOR for FPU context switches), FPU exceptions in user mode cause Oopses, BUGs, recursive faults and other nasty things: fpu exception: 0000 [#1] last sysfs file: /sys/power/state Modules linked in: psmouse evdev pcspkr serio_raw [last unloaded: scsi_wait_scan] Pid: 1638, comm: fxsave-32-excep Not tainted 2.6.35-07798-g58a992b-dirty #633 VP3-596B-DD/VT82C597 EIP: 0060:[<c1003527>] EFLAGS: 00010202 CPU: 0 EIP is at math_error+0x1b4/0x1c8 EAX: 00000003 EBX: cf9be7e0 ECX: 00000000 EDX: cf9c5c00 ESI: cf9d9fb4 EDI: c1372db3 EBP: 00000010 ESP: cf9d9f1c DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068 Process fxsave-32-excep (pid: 1638, ti=cf9d8000 task=cf9be7e0 task.ti=cf9d8000) Stack: 00000000 00000301 00000004 00000000 00000000 cf9d3000 cf9da8f0 00000001 <0> 00000004 cf9b6b60 c1019a6b c1019a79 00000020 00000242 000001b6 cf9c5380 <0> cf806b40 cf791880 00000000 00000282 00000282 c108a213 00000020 cf9c5380 Call Trace: [<c1019a6b>] ? need_resched+0x11/0x1a [<c1019a79>] ? should_resched+0x5/0x1f [<c108a213>] ? do_sys_open+0xbd/0xc7 [<c108a213>] ? do_sys_open+0xbd/0xc7 [<c100353b>] ? do_coprocessor_error+0x0/0x11 [<c12d5965>] ? error_code+0x65/0x70 Code: a8 20 74 30 c7 44 24 0c 06 00 03 00 8d 54 24 04 89 d9 b8 08 00 00 00 e8 9b 6d 02 00 eb 16 8b 93 5c 02 00 00 eb 05 e9 04 ff ff ff <9b> dd 32 9b e9 16 ff ff ff 81 c4 84 00 00 00 5b 5e 5f 5d c3 c6 EIP: [<c1003527>] math_error+0x1b4/0x1c8 SS:ESP 0068:cf9d9f1c This usually continues in slight variations until the system is reset. This bug was introduced by commit 58a992b9cbaf449aeebd3575c3695a9eb5d95b5e: x86-32, fpu: Rewrite fpu_save_init() Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Link: http://lkml.kernel.org/r/1302106003-366952-1-git-send-email-hans.rosenfeld@amd.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-04-06 16:53:01 -07:00
Jason Baron	ef64789413	jump label: Add _ASM_ALIGN for x86 and x86_64 The linker should not be adding holes to word size aligned pointers, but out of paranoia we are explicitly specifying that alignment. I have not seen any holes in the jump label section in practice. Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <e119fbd060c9452c56063ea6148ba1070e7434cc.1300299760.git.jbaron@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 13:42:51 -04:00
Jason Baron	d430d3d7e6	jump label: Introduce static_branch() interface Introduce: static __always_inline bool static_branch(struct jump_label_key key); instead of the old JUMP_LABEL(key, label) macro. In this way, jump labels become really easy to use: Define: struct jump_label_key jump_key; Can be used as: if (static_branch(&jump_key)) do unlikely code enable/disale via: jump_label_inc(&jump_key); jump_label_dec(&jump_key); that's it! For the jump labels disabled case, the static_branch() becomes an atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(), atomic_dec() operations. We show testing results for this change below. Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct. Since we now require a 'struct jump_label_key key', we can store a pointer into the jump table addresses. In this way, we can enable/disable jump labels, in basically constant time. This change allows us to completely remove the previous hashtable scheme. Thanks to Peter Zijlstra for this re-write. Testing: I ran a series of 'tbench 20' runs 5 times (with reboots) for 3 configurations, where tracepoints were disabled. jump label configured in avg: 815.6 jump label not configured in (using atomic reads) avg: 800.1 jump label not configured in (regular reads) avg: 803.4 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110316212947.GA8792@redhat.com> Signed-off-by: Jason Baron <jbaron@redhat.com> Suggested-by: H. Peter Anvin <hpa@linux.intel.com> Tested-by: David Daney <ddaney@caviumnetworks.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-04-04 12:48:08 -04:00
Tejun Heo	052936080c	x86-64, NUMA: Remove custom phys_to_nid() implementation phys_to_nid() maps physical address to NUMA node id. This is implemented by building perfect hash in compute_hash_shift() during initialization. However, with SPARSE memory model, the nid is encoded in page flags. The perfect hash implementation was for DISCONTIG memory model which got removed years ago by `b263295dbf` (x86: 64-bit, make sparsemem vmemmap the only memory model). So, the perfect hash ends up being used only during initialization when the core SPARSE code already provides perfectly acceptable generic early_pfn_to_nid() implementation. Drop phys_to_nid() and use the generic ealry_pfn_to_nid() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de>	2011-04-01 11:15:12 +02:00
Christoph Lameter	349c004e3d	x86: A fast way to check capabilities of the current cpu Add this_cpu_has() which determines if the current cpu has a certain ability using a segment prefix and a bit test operation. For that we need to add bit operations to x86s percpu.h. Many uses of cpu_has use a pointer passed to a function to determine the current flags. That is no longer necessary after this patch. However, this patch only converts the straightforward cases where cpu_has is used with this_cpu_ptr. The rest is work for later. -tj: Rolled up patch to add x86_ prefix and use percpu_read() instead of percpu_read_stable(). Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-03-29 10:18:30 +02:00
Jean Delvare	ca444564a9	x86: Stop including <linux/delay.h> in two asm header files Stop including <linux/delay.h> in x86 header files which don't need it. This will let the compiler complain when this header is not included by source files when it should, so that contributors can fix the problem before building on other architectures starts to fail. Credits go to Geert for the idea. Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: James E.J. Bottomley <James.Bottomley@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> LKML-Reference: <20110325152014.297890ec@endymion.delvare> [ this also fixes an upstream build bug in drivers/media/rc/ite-cir.c ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-29 09:37:42 +02:00
Eric Dumazet	5f55924dea	percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu_cmpxchg16b_double() uses alternative_io() and looks like : e8 .. .. .. .. call this_cpu_cmpxchg16b_emu X bytes NOPX or, once patched (if cpu supports native instruction) on SMP build : 65 48 0f c7 0e cmpxchg16b %gs:(%rsi) 0f 94 c0 sete %al on !SMP build : 48 0f c7 0e cmpxchg16b (%rsi) 0f 94 c0 sete %al Therefore, NOPX should be : P6_NOP3 on SMP P6_NOP2 on !SMP Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-03-28 18:06:58 +02:00
Christoph Lameter	d7c3f8cee8	percpu: Omit segment prefix in the UP case for cmpxchg_double Omit the segment prefix in the UP case. GS is not used then and we will generate segfaults if cmpxchg16b is used otherwise. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-27 19:25:36 -07:00
Linus Torvalds	047f61c5d1	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (42 commits) ACPI: minor printk format change in acpi_pad ACPI: make acpi_pad /sys output more readable ACPICA: Update version to 20110316 ACPICA: Header support for SLIC table ACPI: Make sure the FADT is at least rev 2 before using the reset register ACPI: Bug compatibility for Windows on the ACPI reboot vector ACPICA: Fix access width for reset vector ACPI battery: fribble sysfs files from a resume notifier ACPI button: remove unused procfs I/F ACPI, APEI, Add PCIe AER error information printing support PCIe, AER, use pre-generated prefix in error information printing ACPI, APEI, Add ERST record ID cache ACPI: Use syscore_ops instead of sysdev class and sysdev ACPI: Remove the unused EC sysdev class ACPI: use __cpuinit for the acpi_processor_set_pdc() call tree ACPI: use __init where possible in processor driver Thermal_Framework-Fix_crash_during_hwmon_unregister ACPICA: Update version to 20110211. ACPICA: Add mechanism to defer _REG methods for some installed handlers ACPICA: Add support for FunctionalFixedHW in acpi_ut_get_region_name ...	2011-03-24 08:25:15 -07:00
Linus Torvalds	b81a618dcd	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: deal with races in /proc//{syscall,stack,personality} proc: enable writing to /proc/pid/mem proc: make check_mem_permission() return an mm_struct on success proc: hold cred_guard_mutex in check_mem_permission() proc: disable mem_write after exec mm: implement access_remote_vm mm: factor out main logic of access_process_vm mm: use mm_struct to resolve gate vma's in __get_user_pages mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm mm: arch: make in_gate_area take an mm_struct instead of a task_struct mm: arch: make get_gate_vma take an mm_struct instead of a task_struct x86: mark associated mm when running a task in 32 bit compatibility mode x86: add context tag to mark mm when running a task in 32-bit compatibility mode auxv: require the target to be tracable (or yourself) close race in /proc//environ report errors in /proc//map* sanely pagemap: close races with suid execve make sessionid permissions in /proc//task/ match those in /proc/* fix leaks in path_lookupat() Fix up trivial conflicts in fs/proc/base.c	2011-03-23 20:51:42 -07:00
FUJITA Tomonori	8547727756	remove dma64_addr_t There is no user now. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: David Miller <davem@davemloft.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:47:18 -07:00
Akinobu Mita	61f2e7b0f4	bitops: remove minix bitops from asm/bitops.h minix bit operations are only used by minix filesystem and useless by other modules. Because byte order of inode and block bitmaps is different on each architecture like below: m68k: big-endian 16bit indexed bitmaps h8300, microblaze, s390, sparc, m68knommu: big-endian 32 or 64bit indexed bitmaps m32r, mips, sh, xtensa: big-endian 32 or 64bit indexed bitmaps for big-endian mode little-endian bitmaps for little-endian mode Others: little-endian bitmaps In order to move minix bit operations from asm/bitops.h to architecture independent code in minix filesystem, this provides two config options. CONFIG_MINIX_FS_BIG_ENDIAN_16BIT_INDEXED is only selected by m68k. CONFIG_MINIX_FS_NATIVE_ENDIAN is selected by the architectures which use native byte order bitmaps (h8300, microblaze, s390, sparc, m68knommu, m32r, mips, sh, xtensa). The architectures which always use little-endian bitmaps do not select these options. Finally, we can remove minix bit operations from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Greg Ungerer <gerg@uclinux.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Andreas Schwab <schwab@linux-m68k.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Michal Simek <monstr@monstr.eu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hirokazu Takata <takata@linux-m32r.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:22 -07:00
Akinobu Mita	f312eff816	bitops: remove ext2 non-atomic bitops from asm/bitops.h As the result of conversions, there are no users of ext2 non-atomic bit operations except for ext2 filesystem itself. Now we can put them into architecture independent code in ext2 filesystem, and remove from asm/bitops.h for all architectures. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:21 -07:00
Akinobu Mita	861b5ae7cd	bitops: introduce little-endian bitops for most architectures Introduce little-endian bit operations to the big-endian architectures which do not have native little-endian bit operations and the little-endian architectures. (alpha, avr32, blackfin, cris, frv, h8300, ia64, m32r, mips, mn10300, parisc, sh, sparc, tile, x86, xtensa) These architectures can just include generic implementation (asm-generic/bitops/le.h). Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Matthew Wilcox <willy@debian.org> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Acked-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-23 19:46:15 -07:00
Stephen Wilson	c2ef45df3b	x86: add context tag to mark mm when running a task in 32-bit compatibility mode This tag is intended to mirror the thread info TIF_IA32 flag. Will be used to identify mm's which support 32 bit tasks running in compatibility mode without requiring a reference to the task itself. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-23 16:36:52 -04:00
Len Brown	02e2407858	Merge branch 'linus' into release Conflicts: arch/x86/kernel/acpi/sleep.c Signed-off-by: Len Brown <len.brown@intel.com>	2011-03-23 02:34:54 -04:00
Len Brown	5c129a8600	Merge commit 'v2.6.38' into release	2011-03-23 02:33:54 -04:00
David Rientjes	1c00f0161f	x86: allow CONFIG_ISA_DMA_API to be disabled Not all 64-bit systems require ISA-style DMA, so allow it to be configurable. x86 utilizes the generic ISA DMA allocator from kernel/dma.c, so require it only when CONFIG_ISA_DMA_API is enabled. Disabling CONFIG_ISA_DMA_API is dependent on x86_64 since those machines do not have ISA slots and benefit the most from disabling the option (and on CONFIG_EXPERT as required by H. Peter Anvin). When disabled, this also avoids declaring claim_dma_lock(), release_dma_lock(), request_dma(), and free_dma() since those interfaces will no longer be provided. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:16 -07:00
FUJITA Tomonori	3e50594e8e	add the common dma_addr_t typedef to include/linux/types.h All architectures can use the common dma_addr_t typedef now. We can remove the arch specific dma_addr_t. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Cc: Matt Turner <mattst88@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:09 -07:00
Eric Dumazet	b6a84016bd	mm: NUMA aware alloc_thread_info_node() Add a node parameter to alloc_thread_info(), and change its name to alloc_thread_info_node() This change is needed to allow NUMA aware kthread_create_on_cpu() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-03-22 17:44:01 -07:00
Sage Weil	b7ed78f565	introduce sys_syncfs to sync a single file system It is frequently useful to sync a single file system, instead of all mounted file systems via sync(2): - On machines with many mounts, it is not at all uncommon for some of them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on those and may never get to the one you do care about (e.g., /). - Some applications write lots of data to the file system and then want to make sure it is flushed to disk. Calling fsync(2) on each file introduces unnecessary ordering constraints that result in a large amount of sub-optimal writeback/flush/commit behavior by the file system. There are currently two ways (that I know of) to sync a single super_block: - BLKFLSBUF ioctl on the block device: That also invalidates the bdev mapping, which isn't usually desirable, and doesn't work for non-block file systems. - 'mount -o remount,rw' will call sync_filesystem as an artifact of the current implemention. Relying on this little-known side effect for something like data safety sounds foolish. Both of these approaches require root privileges, which some applications do not have (nor should they need?) given that sync(2) is an unprivileged operation. This patch introduces a new system call syncfs(2) that takes an fd and syncs only the file system it references. Maybe someday we can $ sync /some/path and not get sync: ignoring all arguments The syscall is motivated by comments by Al and Christoph at the last LSF. syncfs(2) seems like an appropriate name given statfs(2). A similar ioctl was also proposed a while back, see http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2 Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-21 00:40:29 -04:00
Len Brown	05534c9ffc	Merge branch 'acpica' into release	2011-03-18 18:06:08 -04:00
Linus Torvalds	f2e1fbb5f2	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Flush TLB if PGD entry is changed in i386 PAE mode x86, dumpstack: Correct stack dump info when frame pointer is available x86: Clean up csum-copy_64.S a bit x86: Fix common misspellings x86: Fix misspelling and align params x86: Use PentiumPro-optimized partial_csum() on VIA C7	2011-03-18 10:45:21 -07:00
Shaohua Li	4981d01ead	x86: Flush TLB if PGD entry is changed in i386 PAE mode According to intel CPU manual, every time PGD entry is changed in i386 PAE mode, we need do a full TLB flush. Current code follows this and there is comment for this too in the code. But current code misses the multi-threaded case. A changed page table might be used by several CPUs, every such CPU should flush TLB. Usually this isn't a problem, because we prepopulate all PGD entries at process fork. But when the process does munmap and follows new mmap, this issue will be triggered. When it happens, some CPUs keep doing page faults: http://marc.info/?l=linux-kernel&m=129915020508238&w=2 Reported-by: Yasunori Goto<y-goto@jp.fujitsu.com> Tested-by: Yasunori Goto<y-goto@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Shaohua Li<shaohua.li@intel.com> Cc: Mallick Asit K <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm <linux-mm@kvack.org> Cc: stable <stable@kernel.org> LKML-Reference: <1300246649.2337.95.camel@sli10-conroe> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 11:44:01 +01:00
Namhyung Kim	e8e999cf3c	x86, dumpstack: Correct stack dump info when frame pointer is available Current stack dump code scans entire stack and check each entry contains a pointer to kernel code. If CONFIG_FRAME_POINTER=y it could mark whether the pointer is valid or not based on value of the frame pointer. Invalid entries could be preceded by '?' sign. However this was not going to happen because scan start point was always higher than the frame pointer so that they could not meet. Commit `9c0729dc80` ("x86: Eliminate bp argument from the stack tracing routines") delayed bp acquisition point, so the bp was read in lower frame, thus all of the entries were marked invalid. This patch fixes this by reverting above commit while retaining stack_frame() helper as suggested by Frederic Weisbecker. End result looks like below: before: [ 3.508329] Call Trace: [ 3.508551] [<ffffffff814f35c9>] ? panic+0x91/0x199 [ 3.508662] [<ffffffff814f3739>] ? printk+0x68/0x6a [ 3.508770] [<ffffffff81a981b2>] ? mount_block_root+0x257/0x26e [ 3.508876] [<ffffffff81a9821f>] ? mount_root+0x56/0x5a [ 3.508975] [<ffffffff81a98393>] ? prepare_namespace+0x170/0x1a9 [ 3.509216] [<ffffffff81a9772b>] ? kernel_init+0x1d2/0x1e2 [ 3.509335] [<ffffffff81003894>] ? kernel_thread_helper+0x4/0x10 [ 3.509442] [<ffffffff814f6880>] ? restore_args+0x0/0x30 [ 3.509542] [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2 [ 3.509641] [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10 after: [ 3.522991] Call Trace: [ 3.523351] [<ffffffff814f35b9>] panic+0x91/0x199 [ 3.523468] [<ffffffff814f3729>] ? printk+0x68/0x6a [ 3.523576] [<ffffffff81a981b2>] mount_block_root+0x257/0x26e [ 3.523681] [<ffffffff81a9821f>] mount_root+0x56/0x5a [ 3.523780] [<ffffffff81a98393>] prepare_namespace+0x170/0x1a9 [ 3.523885] [<ffffffff81a9772b>] kernel_init+0x1d2/0x1e2 [ 3.523987] [<ffffffff81003894>] kernel_thread_helper+0x4/0x10 [ 3.524228] [<ffffffff814f6880>] ? restore_args+0x0/0x30 [ 3.524345] [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2 [ 3.524445] [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10 -v5: * fix build breakage with oprofile -v4: * use 0 instead of regs->bp * separate out printk changes -v3: * apply comment from Frederic * add a couple of printk fixes Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Soren Sandmann <ssp@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <1300416006-3163-1-git-send-email-namhyung@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 10:51:42 +01:00
Lucas De Marchi	0d2eb44f63	x86: Fix common misspellings They were generated by 'codespell' and then manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: trivial@kernel.org LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 10:39:30 +01:00
Ingo Molnar	8dd8997d2c	Merge branch 'linus' into x86/urgent Merge reason: Merge upstream commits to avoid conflicts in upcoming patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 10:39:00 +01:00
Xiao Guangrong	0f53b5b1c0	KVM: MMU: cleanup pte write path This patch does: - call vcpu->arch.mmu.update_pte directly - use gfn_to_pfn_atomic in update_pte path The suggestion is from Avi. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:35 -03:00
Gleb Natapov	5601d05b8c	KVM: emulator: Fix io permission checking for 64bit guest Current implementation truncates upper 32bit of TR base address during IO permission bitmap check. The patch fixes this. Reported-and-tested-by: Francis Moreau <francis.moro@gmail.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:33 -03:00
Xiao Guangrong	49b26e26e4	KVM: MMU: do not record gfn in kvm_mmu_pte_write No need to record the gfn to verifier the pte has the same mode as current vcpu, it's because we only speculatively update the pte only if the pte and vcpu have the same mode Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:32 -03:00
Jan Kiszka	038f8c110e	KVM: x86: Convert tsc_write_lock to raw_spinlock Code under this lock requires non-preemptibility. Ensure this also over -rt by converting it to raw spinlock. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:30 -03:00
Jan Kiszka	e935b8372c	KVM: Convert kvm_lock to raw_spinlock Code under this lock requires non-preemptibility. Ensure this also over -rt by converting it to raw spinlock. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-03-17 13:08:30 -03:00
Avi Kivity	d867162c6d	KVM: x86 emulator: vendor specific instructions Mark some instructions as vendor specific, and allow the caller to request emulation only of vendor specific instructions. This is useful in some circumstances (responding to a #UD fault). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-03-17 13:08:28 -03:00
john cooper	91c9c3eda4	KVM: x86: handle guest access to BBL_CR_CTL3 MSR A correction to Intel cpu model CPUID data (patch queued) caused winxp to BSOD when booted with a Penryn model. This was traced to the CPUID "model" field correction from 6 -> 23 (as is proper for a Penryn class of cpu). Only in this case does the problem surface. The cause for this failure is winxp accessing the BBL_CR_CTL3 MSR which is unsupported by current kvm, appears to be a legacy MSR not fully characterized yet existing in current silicon, and is apparently carried forward in MSR space to accommodate vintage code as here. It is not yet conclusive whether this MSR implements any of its legacy functionality or is just an ornamental dud for compatibility. While I found no silicon version specific documentation link to this MSR, a general description exists in Intel's developer's reference which agrees with the functional behavior of other bootloader/kernel code I've examined accessing BBL_CR_CTL3. Regrettably winxp appears to be setting bit #19 called out as "reserved" in the above document. So to minimally accommodate this MSR, kvm msr get will provide the equivalent mock data and kvm msr write will simply toss the guest passed data without interpretation. While this treatment of BBL_CR_CTL3 addresses the immediate problem, the approach may be modified pending clarification from Intel. Signed-off-by: john cooper <john.cooper@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-03-17 13:08:27 -03:00
Linus Torvalds	41e0e0738c	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, AMD: Set ARAT feature on AMD processors x86, quirk: Fix SB600 revision check x86: stop_machine_text_poke() should issue sync_core() x86, amd-nb: Misc cleanliness fixes	2011-03-16 10:14:56 -07:00
Linus Torvalds	e7fd3b4669	Merge branch 'x86-trampoline-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-trampoline-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix binutils-2.21 symbol related build failures x86-64, trampoline: Remove unused variable x86, reboot: Fix the use of passed arguments in 32-bit BIOS reboot x86, reboot: Move the real-mode reboot code to an assembly file x86: Make the GDT_ENTRY() macro in <asm/segment.h> safe for assembly x86, trampoline: Use the unified trampoline setup for ACPI wakeup x86, trampoline: Common infrastructure for low memory trampolines Fix up trivial conflicts in arch/x86/kernel/Makefile	2011-03-16 10:10:02 -07:00
Ingo Molnar	344c21c322	Merge branch 'x86/amd-nb' into x86/urgent Merge reason: This is one followup commit that was not in x86/mm - merge it via the urgent path Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-16 16:34:01 +01:00
Linus Torvalds	79d8a8f736	Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support percpu: Generic support for this_cpu_cmpxchg_double() alpha: use L1_CACHE_BYTES for cacheline size in the linker script percpu: align percpu readmostly subsection to cacheline Fix up trivial conflict in arch/x86/kernel/vmlinux.lds.S due to the percpu alignment having changed ("x86: Reduce back the alignment of the per-CPU data section")	2011-03-16 08:22:41 -07:00
Linus Torvalds	d10902812c	Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) x86: Clean up apic.c and apic.h x86: Remove superflous goal definition of tsc_sync x86: dt: Correct local apic documentation in device tree bindings x86: dt: Cleanup local apic setup x86: dt: Fix OLPC=y/INTEL_CE=n build rtc: cmos: Add OF bindings x86: ce4100: Use OF to setup devices x86: ioapic: Add OF bindings for IO_APIC x86: dtb: Add generic bus probe x86: dtb: Add support for PCI devices backed by dtb nodes x86: dtb: Add device tree support for HPET x86: dtb: Add early parsing of IO_APIC x86: dtb: Add irq domain abstraction x86: dtb: Add a device tree for CE4100 x86: Add device tree support x86: e820: Remove conditional early mapping in parse_e820_ext x86: OLPC: Make OLPC=n build again x86: OLPC: Remove extra OLPC_OPENFIRMWARE_DT indirection x86: OLPC: Cleanup config maze completely x86: OLPC: Hide OLPC_OPENFIRMWARE config switch ... Fix up conflicts in arch/x86/platform/ce4100/ce4100.c	2011-03-15 20:01:36 -07:00
Linus Torvalds	181f977d13	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (93 commits) x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others() x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() x86-64, NUMA: Clean up initmem_init() x86-64, NUMA: Fix numa_emulation code with node0 without RAM x86-64, NUMA: Revert NUMA affine page table allocation x86: Work around old gas bug x86-64, NUMA: Better explain numa_distance handling x86-64, NUMA: Fix distance table handling mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK x86-64, NUMA: Fix size of numa_distance array x86: Rename e820_table_* to pgt_buf_* bootmem: Move __alloc_memory_core_early() to nobootmem.c bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance() x86-64, NUMA: Add proper function comments to global functions x86-64, NUMA: Move NUMA emulation into numa_emulation.c x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file x86-64, NUMA: Do not scan two times for setup_node_bootmem() ... Fix up conflicts in arch/x86/kernel/smpboot.c	2011-03-15 19:49:10 -07:00
Linus Torvalds	5f6fb45466	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (116 commits) x86: Enable forced interrupt threading support x86: Mark low level interrupts IRQF_NO_THREAD x86: Use generic show_interrupts x86: ioapic: Avoid redundant lookup of irq_cfg x86: ioapic: Use new move_irq functions x86: Use the proper accessors in fixup_irqs() x86: ioapic: Use irq_data->state x86: ioapic: Simplify irq chip and handler setup x86: Cleanup the genirq name space genirq: Add chip flag to force mask on suspend genirq: Add desc->irq_data accessor genirq: Add comments to Kconfig switches genirq: Fixup fasteoi handler for oneshot mode genirq: Provide forced interrupt threading sched: Switch wait_task_inactive to schedule_hrtimeout() genirq: Add IRQF_NO_THREAD genirq: Allow shared oneshot interrupts genirq: Prepare the handling of shared oneshot interrupts genirq: Make warning in handle_percpu_event useful x86: ioapic: Move trigger defines to io_apic.h ... Fix up trivial(?) conflicts in arch/x86/pci/xen.c due to genirq name space changes clashing with the Xen cleanups. The set_irq_msi() had moved to xen_bind_pirq_msi_to_irq().	2011-03-15 19:23:40 -07:00
Linus Torvalds	502f4d4f74	Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix and clean up generic_processor_info() x86: Don't copy per_cpu cpuinfo for BSP two times x86: Move llc_shared_map out of cpu_info	2011-03-15 19:00:53 -07:00
Linus Torvalds	da849abeb8	Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, binutils, xen: Fix another wrong size directive x86: Remove dead config option X86_CPU x86: Really print supported CPUs if PROCESSOR_SELECT=y x86: Fix a bogus unwind annotation in lib/semaphore_32.S um, x86-64: Fix UML build after adding CFI annotations to lib/rwsem_64.S x86: Remove unused bits from lib/thunk_*.S x86: Use {push,pop}_cfi in more places x86-64: Add CFI annotations to lib/rwsem_64.S x86, asm: Cleanup unnecssary macros in asm-offsets.c x86, system.h: Drop unused __SAVE/__RESTORE macros x86: Use bitmap library functions x86: Partly unify asm-offsets_{32,64}.c x86: Reduce back the alignment of the per-CPU data section	2011-03-15 18:59:56 -07:00
Linus Torvalds	420c1c572d	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (62 commits) posix-clocks: Check write permissions in posix syscalls hrtimer: Remove empty hrtimer_init_hres_timer() hrtimer: Update hrtimer->state documentation hrtimer: Update base[CLOCK_BOOTTIME].offset correctly timers: Export CLOCK_BOOTTIME via the posix timers interface timers: Add CLOCK_BOOTTIME hrtimer base time: Extend get_xtime_and_monotonic_offset() to also return sleep time: Introduce get_monotonic_boottime and ktime_get_boottime hrtimers: extend hrtimer base code to handle more then 2 clockids ntp: Remove redundant and incorrect parameter check mn10300: Switch do_timer() to xtimer_update() posix clocks: Introduce dynamic clocks posix-timers: Cleanup namespace posix-timers: Add support for fd based clocks x86: Add clock_adjtime for x86 posix-timers: Introduce a syscall for clock tuning. time: Splitout compat timex accessors ntp: Add ADJ_SETOFFSET mode bit time: Introduce timekeeping_inject_offset posix-timer: Update comment ... Fix up new system-call-related conflicts in arch/x86/ia32/ia32entry.S arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h arch/x86/kernel/syscall_table_32.S (name_to_handle_at()/open_by_handle_at() vs clock_adjtime()), and some due to movement of get_jiffies_64() in: kernel/time.c	2011-03-15 18:53:35 -07:00
Linus Torvalds	a926021cb1	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits) perf probe: Clean up probe_point_lazy_walker() return value tracing: Fix irqoff selftest expanding max buffer tracing: Align 4 byte ints together in struct tracer tracing: Export trace_set_clr_event() tracing: Explain about unstable clock on resume with ring buffer warning ftrace/graph: Trace function entry before updating index ftrace: Add .ref.text as one of the safe areas to trace tracing: Adjust conditional expression latency formatting. tracing: Fix event alignment: skb:kfree_skb tracing: Fix event alignment: mce:mce_record tracing: Fix event alignment: kvm:kvm_hv_hypercall tracing: Fix event alignment: module:module_request tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup tracing: Remove lock_depth from event entry perf header: Stop using 'self' perf session: Use evlist/evsel for managing perf.data attributes perf top: Don't let events to eat up whole header line perf top: Fix events overflow in top command ring-buffer: Remove unused #include <linux/trace_irq.h> tracing: Add an 'overwrite' trace_option. ...	2011-03-15 18:31:30 -07:00
Linus Torvalds	0586bed3e8	Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rtmutex: tester: Remove the remaining BKL leftovers lockdep/timers: Explain in detail the locking problems del_timer_sync() may cause rtmutex: Simplify PI algorithm and make highest prio task get lock rwsem: Remove redundant asmregparm annotation rwsem: Move duplicate function prototypes to linux/rwsem.h rwsem: Unify the duplicate rwsem_is_locked() inlines rwsem: Move duplicate init macros and functions to linux/rwsem.h rwsem: Move duplicate struct rwsem declaration to linux/rwsem.h x86: Cleanup rwsem_count_t typedef rwsem: Cleanup includes locking: Remove deprecated lock initializers cred: Replace deprecated spinlock initialization kthread: Replace deprecated spinlock initialization xtensa: Replace deprecated spinlock initialization um: Replace deprecated spinlock initialization sparc: Replace deprecated spinlock initialization mips: Replace deprecated spinlock initialization cris: Replace deprecated spinlock initialization alpha: Replace deprecated spinlock initialization rtmutex-tester: Remove BKL tests	2011-03-15 18:28:30 -07:00
Linus Torvalds	b80cd62b7d	Merge branch 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: arm: Remove bogus comment in futex_atomic_cmpxchg_inatomic() futex: Deobfuscate handle_futex_death() plist: Add priority list test plist: Shrink struct plist_head futex,plist: Remove debug lock assignment from plist_node futex,plist: Pass the real head of the priority list to plist_del() futex: Sanitize futex ops argument types futex: Sanitize cmpxchg_futex_value_locked API futex: Remove redundant pagefault_disable in futex_atomic_cmpxchg_inatomic() futex: Avoid redudant evaluation of task_pid_vnr() futex: Update futex_wait_setup comments about locking	2011-03-15 18:23:52 -07:00
Linus Torvalds	422e6c4bc4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits) tidy the trailing symlinks traversal up Turn resolution of trailing symlinks iterative everywhere simplify link_path_walk() tail Make trailing symlink resolution in path_lookupat() iterative update nd->inode in __do_follow_link() instead of after do_follow_link() pull handling of one pathname component into a helper fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH Allow passing O_PATH descriptors via SCM_RIGHTS datagrams readlinkat(), fchownat() and fstatat() with empty relative pathnames Allow O_PATH for symlinks New kind of open files - "location only". ext4: Copy fs UUID to superblock ext3: Copy fs UUID to superblock. vfs: Export file system uuid via /proc/<pid>/mountinfo unistd.h: Add new syscalls numbers to asm-generic x86: Add new syscalls for x86_64 x86: Add new syscalls for x86_32 fs: Remove i_nlink check from file system link callback fs: Don't allow to create hardlink for deleted file vfs: Add open by file handle support ...	2011-03-15 15:48:13 -07:00
Dan Williams	5d94e81f69	x86: Introduce pci_map_biosrom() The isci driver needs to retrieve its preboot OROM image which contains necessary runtime parameters like platform specific sas addresses and phy configuration. There is no ROM BAR associated with this area, instead we will need to scan legacy expansion ROM space. 1/ Promote the probe_roms_32 implementation to x86-64 2/ Add a facility to find and map an adapter rom by pci device (according to PCI Firmware Specification Revision 3.0) Signed-off-by: Dave Jiang <dave.jiang@intel.com> LKML-Reference: <20110308183226.6246.90354.stgit@localhost6.localdomain6> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-03-15 15:34:15 -07:00
Daniel Drake	45bb1674b9	x86, olpc: Use device tree for platform identification Make OLPC fully depend on device tree, and use it to identify the OLPC platform details. Some nodes are exposed as platform devices where we plan to use device tree for device probing. Signed-off-by: Daniel Drake <dsd@laptop.org> Acked-by: Grant Likely <grant.likely@secretlab.ca> LKML-Reference: <20110313151017.C255F9D401E@zog.reactivated.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-03-15 14:17:23 -07:00
Linus Torvalds	76ca078328	Merge branch 'for-linus' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm * 'for-linus' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm: xen: suspend: remove xen_hvm_suspend xen: suspend: pull pre/post suspend hooks out into suspend_info xen: suspend: move arch specific pre/post suspend hooks into generic hooks xen: suspend: refactor non-arch specific pre/post suspend hooks xen: suspend: add "arch" to pre/post suspend hooks xen: suspend: pass extra hypercall argument via suspend_info struct xen: suspend: refactor cancellation flag into a structure xen: suspend: use HYPERVISOR_suspend for PVHVM case instead of open coding xen: switch to new schedop hypercall by default. xen: use new schedop interface for suspend xen: do not respond to unknown xenstore control requests xen: fix compile issue if XEN is enabled but XEN_PVHVM is disabled xen: PV on HVM: support PV spinlocks and IPIs xen: make the ballon driver work for hvm domains xen-blkfront: handle Xen major numbers other than XENVBD xen: do not use xen_info on HVM, set pv_info name to "Xen HVM" xen: no need to delay xen_setup_shutdown_event for hvm guests anymore	2011-03-15 10:59:09 -07:00
Linus Torvalds	397fae0818	Merge branches 'stable/irq.rework' and 'stable/pcifront-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/irq.rework' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/irq: Cleanup up the pirq_to_irq for DomU PV PCI passthrough guests as well. xen: Use IRQF_FORCE_RESUME xen/timer: Missing IRQF_NO_SUSPEND in timer code broke suspend. xen: Fix compile error introduced by "switch to new irq_chip functions" xen: Switch to new irq_chip functions xen: Remove stale irq_chip.end xen: events: do not free legacy IRQs xen: events: allocate GSIs and dynamic IRQs from separate IRQ ranges. xen: events: add xen_allocate_irq_{dynamic, gsi} and xen_free_irq xen:events: move find_unbound_irq inside CONFIG_PCI_MSI xen: handled remapped IRQs when enabling a pcifront PCI device. genirq: Add IRQF_FORCE_RESUME * 'stable/pcifront-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: pci/xen: When free-ing MSI-X/MSI irq->desc also use generic code. pci/xen: Cleanup: convert int** to int[] pci/xen: Use xen_allocate_pirq_msi instead of xen_allocate_pirq xen-pcifront: Sanity check the MSI/MSI-X values xen-pcifront: don't use flush_scheduled_work()	2011-03-15 10:47:16 -07:00
Linus Torvalds	c7146dd009	Merge branches 'stable/p2m-identity.v4.9.1' and 'stable/e820' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/p2m-identity.v4.9.1' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/m2p: Check whether the MFN has IDENTITY_FRAME bit set.. xen/m2p: No need to catch exceptions when we know that there is no RAM xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set. xen/debugfs: Add 'p2m' file for printing out the P2M layout. xen/setup: Set identity mapping for non-RAM E820 and E820 gaps. xen/mmu: WARN_ON when racing to swap middle leaf. xen/mmu: Set _PAGE_IOMAP if PFN is an identity PFN. xen/mmu: Add the notion of identity (1-1) mapping. xen: Mark all initial reserved pages for the balloon as INVALID_P2M_ENTRY. * 'stable/e820' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/e820: Don't mark balloon memory as E820_UNUSABLE when running as guest and fix overflow. xen/setup: Inhibit resource API from using System RAM E820 gaps as PCI mem gaps.	2011-03-15 10:32:15 -07:00
Ingo Molnar	8460b3e5bc	Merge commit 'v2.6.38' into x86/mm Conflicts: arch/x86/mm/numa_64.c Merge reason: Resolve the conflict, update the branch to .38. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-15 08:29:44 +01:00
Aneesh Kumar K.V	6aae5f2b20	x86: Add new syscalls for x86_64 This patch add new syscalls to x86_64 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Aneesh Kumar K.V	7dadb755b0	x86: Add new syscalls for x86_32 This patch adds new syscalls to x86_32 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-15 02:21:44 -04:00
Linus Torvalds	869c34f520	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: ce4100: Set pci ops via callback instead of module init x86/mm: Fix pgd_lock deadlock x86/mm: Handle mm_fault_error() in kernel space x86: Don't check for BIOS corruption in first 64K when there's no need to	2011-03-14 15:19:09 -07:00
Stefano Stabellini	706cc9d2a4	xen/m2p: Check whether the MFN has IDENTITY_FRAME bit set.. If there is no proper PFN value in the M2P for the MFN (so we get 0xFFFFF.. or 0x55555, or 0x0), we should consult the M2P override to see if there is an entry for this. [Note: we also consult the M2P override if the MFN is past our machine_to_phys size]. We consult the P2M with the PFN. In case the returned MFN is one of the special values: 0xFFF.., 0x5555 (which signify that the MFN can be either "missing" or it belongs to DOMID_IO) or the p2m(m2p(mfn)) != mfn, we check the M2P override. If we fail the M2P override check, we reset the PFN value to INVALID_P2M_ENTRY. Next we try to find the MFN in the P2M using the MFN value (not the PFN value) and if found, we know that this MFN is an identity value and return it as so. Otherwise we have exhausted all the posibilities and we return the PFN, which at this stage can either be a real PFN value found in the machine_to_phys.. array, or INVALID_P2M_ENTRY value. [v1: Added Review-by tag] Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-03-14 11:17:14 -04:00
Konrad Rzeszutek Wilk	146c4e5117	xen/m2p: No need to catch exceptions when we know that there is no RAM .. beyound what we think is the end of memory. However there might be more System RAM - but assigned to a guest. Hence jump to the M2P override check and consult. [v1: Added Review-by tag] Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-03-14 11:17:13 -04:00
Konrad Rzeszutek Wilk	2222e71bd6	xen/debugfs: Add 'p2m' file for printing out the P2M layout. We walk over the whole P2M tree and construct a simplified view of which PFN regions belong to what level and what type they are. Only enabled if CONFIG_XEN_DEBUG_FS is set. [v2: UNKN->UNKNOWN, use uninitialized_var] [v3: Rebased on top of mmu->p2m code split] [v4: Fixed the else if] Reviewed-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-03-14 11:17:11 -04:00
Konrad Rzeszutek Wilk	f4cec35b0d	xen/mmu: Add the notion of identity (1-1) mapping. Our P2M tree structure is a three-level. On the leaf nodes we set the Machine Frame Number (MFN) of the PFN. What this means is that when one does: pfn_to_mfn(pfn), which is used when creating PTE entries, you get the real MFN of the hardware. When Xen sets up a guest it initially populates a array which has descending (or ascending) MFN values, as so: idx: 0, 1, 2 [0x290F, 0x290E, 0x290D, ..] so pfn_to_mfn(2)==0x290D. If you start, restart many guests that list starts looking quite random. We graft this structure on our P2M tree structure and stick in those MFN in the leafs. But for all other leaf entries, or for the top root, or middle one, for which there is a void entry, we assume it is "missing". So pfn_to_mfn(0xc0000)=INVALID_P2M_ENTRY. We add the possibility of setting 1-1 mappings on certain regions, so that: pfn_to_mfn(0xc0000)=0xc0000 The benefit of this is, that we can assume for non-RAM regions (think PCI BARs, or ACPI spaces), we can create mappings easily b/c we get the PFN value to match the MFN. For this to work efficiently we introduce one new page p2m_identity and allocate (via reserved_brk) any other pages we need to cover the sides (1GB or 4MB boundary violations). All entries in p2m_identity are set to INVALID_P2M_ENTRY type (Xen toolstack only recognizes that and MFNs, no other fancy value). On lookup we spot that the entry points to p2m_identity and return the identity value instead of dereferencing and returning INVALID_P2M_ENTRY. If the entry points to an allocated page, we just proceed as before and return the PFN. If the PFN has IDENTITY_FRAME_BIT set we unmask that in appropriate functions (pfn_to_mfn). The reason for having the IDENTITY_FRAME_BIT instead of just returning the PFN is that we could find ourselves where pfn_to_mfn(pfn)==pfn for a non-identity pfn. To protect ourselves against we elect to set (and get) the IDENTITY_FRAME_BIT on all identity mapped PFNs. This simplistic diagram is used to explain the more subtle piece of code. There is also a digram of the P2M at the end that can help. Imagine your E820 looking as so: 1GB 2GB /-------------------+---------\/----\ /----------\ /---+-----\ \| System RAM \| Sys RAM \|\|ACPI\| \| reserved \| \| Sys RAM \| \-------------------+---------/\----/ \----------/ \---+-----/ ^- 1029MB ^- 2001MB [1029MB = 263424 (0x40500), 2001MB = 512256 (0x7D100), 2048MB = 524288 (0x80000)] And dom0_mem=max:3GB,1GB is passed in to the guest, meaning memory past 1GB is actually not present (would have to kick the balloon driver to put it in). When we are told to set the PFNs for identity mapping (see patch: "xen/setup: Set identity mapping for non-RAM E820 and E820 gaps.") we pass in the start of the PFN and the end PFN (263424 and 512256 respectively). The first step is to reserve_brk a top leaf page if the p2m[1] is missing. The top leaf page covers 512^2 of page estate (1GB) and in case the start or end PFN is not aligned on 512^2*PAGE_SIZE (1GB) we loop on aligned 1GB PFNs from start pfn to end pfn. We reserve_brk top leaf pages if they are missing (means they point to p2m_mid_missing). With the E820 example above, 263424 is not 1GB aligned so we allocate a reserve_brk page which will cover the PFNs estate from 0x40000 to 0x80000. Each entry in the allocate page is "missing" (points to p2m_missing). Next stage is to determine if we need to do a more granular boundary check on the 4MB (or 2MB depending on architecture) off the start and end pfn's. We check if the start pfn and end pfn violate that boundary check, and if so reserve_brk a middle (p2m[x][y]) leaf page. This way we have a much finer granularity of setting which PFNs are missing and which ones are identity. In our example 263424 and 512256 both fail the check so we reserve_brk two pages. Populate them with INVALID_P2M_ENTRY (so they both have "missing" values) and assign them to p2m[1][2] and p2m[1][488] respectively. At this point we would at minimum reserve_brk one page, but could be up to three. Each call to set_phys_range_identity has at maximum a three page cost. If we were to query the P2M at this stage, all those entries from start PFN through end PFN (so 1029MB -> 2001MB) would return INVALID_P2M_ENTRY ("missing"). The next step is to walk from the start pfn to the end pfn setting the IDENTITY_FRAME_BIT on each PFN. This is done in 'set_phys_range_identity'. If we find that the middle leaf is pointing to p2m_missing we can swap it over to p2m_identity - this way covering 4MB (or 2MB) PFN space. At this point we do not need to worry about boundary aligment (so no need to reserve_brk a middle page, figure out which PFNs are "missing" and which ones are identity), as that has been done earlier. If we find that the middle leaf is not occupied by p2m_identity or p2m_missing, we dereference that page (which covers 512 PFNs) and set the appropriate PFN with IDENTITY_FRAME_BIT. In our example 263424 and 512256 end up there, and we set from p2m[1][2][256->511] and p2m[1][488][0->256] with IDENTITY_FRAME_BIT set. All other regions that are void (or not filled) either point to p2m_missing (considered missing) or have the default value of INVALID_P2M_ENTRY (also considered missing). In our case, p2m[1][2][0->255] and p2m[1][488][257->511] contain the INVALID_P2M_ENTRY value and are considered "missing." This is what the p2m ends up looking (for the E820 above) with this fabulous drawing: p2m /--------------\ /-----\ \| &mfn_list[0],\| /-----------------\ \| 0 \|------>\| &mfn_list[1],\| /---------------\ \| ~0, ~0, .. \| \|-----\| \| ..., ~0, ~0 \| \| ~0, ~0, [x]---+----->\| IDENTITY [@256] \| \| 1 \|---\ \--------------/ \| [p2m_identity]+\ \| IDENTITY [@257] \| \|-----\| \ \| [p2m_identity]+\\ \| .... \| \| 2 \|--\ \-------------------->\| ... \| \\ \----------------/ \|-----\| \ \---------------/ \\ \| 3 \|\ \ \\ p2m_identity \|-----\| \ \-------------------->/---------------\ /-----------------\ \| .. +->+ \| [p2m_identity]+-->\| ~0, ~0, ~0, ... \| \-----/ / \| [p2m_identity]+-->\| ..., ~0 \| / /---------------\ \| .... \| \-----------------/ / \| IDENTITY[@0] \| /-+-[x], ~0, ~0.. \| / \| IDENTITY[@256]\|<----/ \---------------/ / \| ~0, ~0, .... \| \| \---------------/ \| p2m_missing p2m_missing /------------------\ /------------\ \| [p2m_mid_missing]+---->\| ~0, ~0, ~0 \| \| [p2m_mid_missing]+---->\| ..., ~0 \| \------------------/ \------------/ where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN \| IDENTITY_BIT) Reviewed-by: Ian Campbell <ian.campbell@citrix.com> [v5: Changed code to use ranges, added ASCII art] [v6: Rebased on top of xen->p2m code split] [v4: Squished patches in just this one] [v7: Added RESERVE_BRK for potentially allocated pages] [v8: Fixed alignment problem] [v9: Changed 1<<3X to 1<<BITS_PER_LONG-X] [v10: Copied git commit description in the p2m code + Add Review tag] [v11: Title had '2-1' - should be '1-1' mapping] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-03-14 11:16:41 -04:00
Sebastian Andrzej Siewior	03150171dc	x86: ce4100: Set pci ops via callback instead of module init Setting the pci ops on subsys initcall unconditionally will break multi platform kernels on anything except ce4100. Use x86_init.pci.init ops to call this only on real ce4100 platforms. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: sodaville@linutronix.de LKML-Reference: <20110314093340.GA21026@www.tglx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-14 15:13:23 +01:00
Michel Lespinasse	8d7718aa08	futex: Sanitize futex ops argument types Change futex_atomic_op_inuser and futex_atomic_cmpxchg_inatomic prototypes to use u32 types for the futex as this is the data type the futex core code uses all over the place. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Darren Hart <darren@dvhart.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110311025058.GD26122@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-11 12:23:31 +01:00
Michel Lespinasse	37a9d912b2	futex: Sanitize cmpxchg_futex_value_locked API The cmpxchg_futex_value_locked API was funny in that it returned either the original, user-exposed futex value OR an error code such as -EFAULT. This was confusing at best, and could be a source of livelocks in places that retry the cmpxchg_futex_value_locked after trying to fix the issue by running fault_in_user_writeable(). This change makes the cmpxchg_futex_value_locked API more similar to the get_futex_value_locked one, returning an error code and updating the original value through a reference argument. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile] Acked-by: Tony Luck <tony.luck@intel.com> [ia64] Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Simek <monstr@monstr.eu> [microblaze] Acked-by: David Howells <dhowells@redhat.com> [frv] Cc: Darren Hart <darren@dvhart.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110311024851.GC26122@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-03-11 12:23:08 +01:00
Henrik Kretzschmar	25874a299e	x86: Clean up apic.c and apic.h This patch moves some functions and variables into init sections, makes a function static and removes some lines of cruft. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> LKML-Reference: <1299826956-8607-2-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-11 08:13:59 +01:00
Linus Torvalds	b5562c9a55	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, UV: Initialize the broadcast assist unit base destination node id properly x86, numa: Fix numa_emulation code with memory-less node0 x86, build: Make sure mkpiggy fails on read error	2011-03-10 13:09:26 -08:00
Cliff Wickman	5471262290	x86, UV: Initialize the broadcast assist unit base destination node id properly The BAU's initialization of the broadcast description header is lacking the coherence domain (high bits) in the nasid. This causes a catastrophic system failure when running on a system with multiple coherence domains. Signed-off-by: Cliff Wickman <cpw@sgi.com> LKML-Reference: <E1PxKBB-0005F0-3U@eag09.americas.sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-09 16:36:16 +01:00
Ingo Molnar	c8b44163b7	Merge commit 'v2.6.38-rc8' into x86/asm Merge reason: Update with the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-09 10:38:59 +01:00
Ingo Molnar	86cb2ec7b2	Merge commit 'v2.6.38-rc8' into perf/core Merge reason: Merge latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-08 17:21:52 +01:00
Ingo Molnar	ca764aaf02	Merge branch 'x86-mm' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into x86/mm	2011-03-05 07:32:45 +01:00
Lin Ming	6909262429	perf: Avoid the percore allocations if the CPU is not HT capable Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299119690-13991-5-git-send-email-ming.m.lin@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-05 07:12:16 +01:00
Andi Kleen	a7e3ed1e47	perf: Add support for supplementary event registers Change logs against Andi's original version: - Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra) - Fixed a major event scheduling issue. There cannot be a ref++ on an event that has already done ref++ once and without calling put_constraint() in between. (Stephane Eranian) - Use thread_cpumask for percore allocation. (Lin Ming) - Use MSR names in the extra reg lists. (Lin Ming) - Remove redundant "c = NULL" in intel_percore_constraints - Fix comment of perf_event_attr::config1 Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event that can be used to monitor any offcore accesses from a core. This is a very useful event for various tunings, and it's also needed to implement the generic LLC-* events correctly. Unfortunately this event requires programming a mask in a separate register. And worse this separate register is per core, not per CPU thread. This patch: - Teaches perf_events that OFFCORE_RESPONSE needs extra parameters. The extra parameters are passed by user space in the perf_event_attr::config1 field. - Adds support to the Intel perf_event core to schedule per core resources. This adds fairly generic infrastructure that can be also used for other per core resources. The basic code has is patterned after the similar AMD northbridge constraints code. Thanks to Stephane Eranian who pointed out some problems in the original version and suggested improvements. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-04 11:32:53 +01:00
Tejun Heo	f891125028	x86-64, NUMA: Revert NUMA affine page table allocation This patch reverts NUMA affine page table allocation added by commit `1411e0ec31` (x86-64, numa: Put pgtable to local node memory). The commit made an undocumented change where the kernel linear mapping strictly follows intersection of e820 memory map and NUMA configuration. If the physical memory configuration has holes or NUMA nodes are not properly aligned, this leads to using unnecessarily smaller mapping size which leads to increased TLB pressure. For details, http://thread.gmane.org/gmane.linux.kernel/1104672 Patches to fix the problem have been proposed but the underlying code needs more cleanup and the approach itself seems a bit heavy handed and it has been determined to revert the feature for now and come back to it in the next developement cycle. http://thread.gmane.org/gmane.linux.kernel/1105959 As init_memory_mapping_high() callsites have been consolidated since the commit, reverting is done manually. Also, the RED-PEN comment in arch/x86/mm/init.c is not restored as the problem no longer exists with memblock based top-down early memory allocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de>	2011-03-04 10:26:36 +01:00
Konrad Rzeszutek Wilk	6eaa412f27	xen: Mark all initial reserved pages for the balloon as INVALID_P2M_ENTRY. With this patch, we diligently set regions that will be used by the balloon driver to be INVALID_P2M_ENTRY and under the ownership of the balloon driver. We are OK using the __set_phys_to_machine as we do not expect to be allocating any P2M middle or entries pages. The set_phys_to_machine has the side-effect of potentially allocating new pages and we do not want that at this stage. We can do this because xen_build_mfn_list_list will have already allocated all such pages up to xen_max_p2m_pfn. We also move the check for auto translated physmap down the stack so it is present in __set_phys_to_machine. [v2: Rebased with mmu->p2m code split] Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-03-03 11:52:48 -05:00
Borislav Petkov	84fd1d35cc	x86, amd-nb: Misc cleanliness fixes Make functions used strictly in bool context return bool. Also, fixup used types and comments, and make a local function static, while at it. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Borislav Petkov <bp@amd64.org> LKML-Reference: <20110303115932.GA8603@aftab> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-03 13:06:20 +01:00
Jan Beulich	d04c579f97	x86: Work around old gas bug Add extra parentheses around a couple of definitions introduced by "x86: Cleanup vector usage" and used in assembly macro arguments, and remove spaces. Without that old (2.16.1) gas would see more macro arguments than were actually specified. Reported-and-tested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Shaohua Li <shaohua.li@intel.com> LKML-Reference: <4D6F81B10200007800034B0B@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-03 12:47:08 +01:00
Linus Torvalds	c7b01d3dc2	Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: intel_idle: disable Atom/Lincroft HW C-state auto-demotion intel_idle: disable NHM/WSM HW C-state auto-demotion	2011-03-02 18:08:03 -08:00
Jan Beulich	60cf637a13	x86: Use {push,pop}_cfi in more places Cleaning up and shortening code... Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Alexander van Heukelum <heukelum@fastmail.fm> LKML-Reference: <4D6BD35002000078000341DA@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-28 18:06:22 +01:00
Don Zickus	299c56966a	x86: Use u32 instead of long to set reset vector back to 0 A customer of ours, complained that when setting the reset vector back to 0, it trashed other data and hung their box. They noticed when only 4 bytes were set to 0 instead of 8, everything worked correctly. Mathew pointed out: \| \| We're supposed to be resetting trampoline_phys_low and \| trampoline_phys_high here, which are two 16-bit values. \| Writing 64 bits is definitely going to overwrite space \| that we're not supposed to be touching. \| So limit the area modified to u32. Signed-off-by: Don Zickus <dzickus@redhat.com> Acked-by: Matthew Garrett <mjg@redhat.com> Cc: <stable@kernel.org> LKML-Reference: <1297139100-424-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-28 16:22:18 +01:00
Christoph Lameter	b9ec40af0e	percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support Support this_cpu_cmpxchg_double() using the cmpxchg16b and cmpxchg8b instructions. -tj: s/percpu_cmpxchg16b/percpu_cmpxchg16b_double/ for consistency and other cosmetic changes. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-02-28 11:20:49 +01:00
Linus Torvalds	958ede7f1b	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems x86/mrst: Fix apb timer rating when lapic timer is used x86: Fix reboot problem on VersaLogic Menlow boards	2011-02-25 14:02:33 -08:00
Ian Campbell	a8b7458363	xen: switch to new schedop hypercall by default. Rename old interface to sched_op_compat and rename sched_op_new to simply sched_op. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-02-25 16:43:10 +00:00
Ian Campbell	8e15597fa4	xen: use new schedop interface for suspend Take the opportunity to comment on the semantics of the PV guest suspend hypercall arguments. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-02-25 16:43:10 +00:00
Thomas Gleixner	a906fdaacc	x86: dt: Cleanup local apic setup Up to now we force enable the local apic in the devicetree setup uncoditionally and set smp_found_config unconditionally to 1 when a devicetree blob is available. This breaks, when local apic is disabled in the Kconfig. Make it consistent by initializing device tree explicitely before smp_get_config() so a non lapic configuration could be used as well. To be functional that would require to implement PIT as an interrupt host, but the only user of this code until now is ce4100 which requires apics to be available. So we leave this up to those who need it. Tested-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-25 16:18:52 +01:00
Andreas Herrmann	7f74f8f28a	x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems On some SB800 systems polarity for IOAPIC pin2 is wrongly specified as low active by BIOS. This caused system hangs after resume from S3 when HPET was used in one-shot mode on such systems because a timer interrupt was missed (HPET signal is high active). For more details see: http://marc.info/?l=linux-kernel&m=129623757413868 Tested-by: Manoj Iyer <manoj.iyer@canonical.com> Tested-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: stable@kernel.org # 37.x, 32.x LKML-Reference: <20110224145346.GD3658@alberich.amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-24 20:30:21 +01:00
Rafael J. Wysocki	f1a2003e22	ACPI / PM: Merge do_suspend_lowlevel() into acpi_save_state_mem() The function do_suspend_lowlevel() is specific to x86 and defined in assembly code, so it should be called from the x86 low-level suspend code rather than from acpi_suspend_enter(). Merge do_suspend_lowlevel() into the x86's acpi_save_state_mem() and change the name of the latter to acpi_suspend_lowlevel(), so that the function's purpose is better reflected by its name. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-02-24 19:58:54 +01:00
Rafael J. Wysocki	c41b93fb85	ACPI / PM: Drop acpi_restore_state_mem() The function acpi_restore_state_mem() has never been and most likely never will be used, so remove it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2011-02-24 19:58:54 +01:00
Yinghai Lu	d1b19426b0	x86: Rename e820_table_* to pgt_buf_* e820_table_{start\|end\|top}, which are used to buffer page table allocation during early boot, are now derived from memblock and don't have much to do with e820. Change the names so that they reflect what they're used for. This patch doesn't introduce any behavior change. -v2: Ingo found that earlier patch "x86: Use early pre-allocated page table buffer top-down" caused crash on 32bit and needed to be dropped. This patch was updated to reflect the change. -tj: Updated commit description. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-02-24 14:52:18 +01:00
Sebastian Andrzej Siewior	4a66b1d95a	x86: dt: Fix OLPC=y/INTEL_CE=n build Both OLPC and CE4100 activate CONFIG_OF. OLPC uses PROMTREE while CE uses FLATTREE. Compiling for OLPC only breaks due to missing flat tree functions and variables. Use proper wrappers and provide an empty x86_flattree_get_config() inline so OF=y FLATTREE=n builds and works. [ tglx: Make it work with HPET_TIMER=n and make a function static ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-24 13:01:59 +01:00
Sebastian Andrzej Siewior	bcc7c1244f	x86: ioapic: Add OF bindings for IO_APIC ioapic_xlate provides a translation from the information in device tree to ioapic related informations. This includes - obtaining hw irq which is the vector number "=> pin number + gsi" - obtaining type (level/edge/..) - programming this information into ioapic ioapic_add_ofnode adds an irq_domain based on informations from the device tree. This information (irq_domain) is required in order to map a device to its proper interrupt controller. [ tglx: Adapted to the io_apic changes, which let us move that whole code to devicetree.c ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-10-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:54 +01:00
Sebastian Andrzej Siewior	96e0a0797e	x86: dtb: Add support for PCI devices backed by dtb nodes x86_of_pci_init() does two things: - it provides a generic irq enable and disable function. enable queries the device tree for the interrupt information, calls ->xlate on the irq host and updates the pci->irq information for the device. - it walks through PCI bus(es) in the device tree and adds its children (device) nodes to appropriate pci_dev nodes in kernel. So the dtb node information is available at probe time of the PCI device. Adding a PCI bus based on the information in the device tree is currently not supported. Right now direct access via ioports is used. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-8-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior	3879a6f329	x86: dtb: Add early parsing of IO_APIC APIC and IO_APIC have to be added to the system early because native_init_IRQ() requires it. In order to obtain the address of the ioapic the device tree has to be unflattened so of_address_to_resource() works. The device tree is relocated to ensure it is always covered by the kernel mapping. That way the boot loader does not have to make any assumptions about kernel's memory layout. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org Cc: Dirk Brandewie <dirk.brandewie@gmail.com> LKML-Reference: <1298405266-1624-6-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior	19c4f5f7f7	x86: dtb: Add irq domain abstraction The here introduced irq_domain abstraction represents a generic irq controller. It is a subset of powerpc's irq_host which is going to be renamed to irq_domain and then become generic. This implementation will be removed once it is generic. The xlate callback is resposible to parse irq informations like irq type and number and returns the hardware irq number which is reported by the hardware as active. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-5-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:53 +01:00
Sebastian Andrzej Siewior	da6b737b9a	x86: Add device tree support This patch adds minimal support for device tree on x86. The device tree blob is passed to the kernel via setup_data which requires at least boot protocol 2.09. Memory size, restricted memory regions, boot arguments are gathered the traditional way so things like cmd_line are just here to let the code compile. The current plan is use the device tree as an extension and to gather information which can not be enumerated and would have to be hardcoded otherwise. This includes things like - which devices are on this I2C/SPI bus? - how are the interrupts wired to IO APIC? - where could my hpet be? Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com> Acked-by: Grant Likely <grant.likely@secretlab.ca> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-3-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:52 +01:00
Sebastian Andrzej Siewior	f1c2b35714	x86: e820: Remove conditional early mapping in parse_e820_ext This patch ensures that the memory passed from parse_setup_data() is large enough to cover the complete data structure. That means that the conditional mapping in parse_e820_ext() can go. While here, I also attempt not to map two pages if the address is not aligned to a page boundary. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com> Cc: sodaville@linutronix.de Cc: devicetree-discuss@lists.ozlabs.org LKML-Reference: <1298405266-1624-2-git-send-email-bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 22:27:52 +01:00
Thomas Gleixner	cb4cfd568c	Merge branch 'x86/apic' into x86/platform Reason: Devicetree based ioapic setup depends on the apic changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 20:01:01 +01:00
Thomas Gleixner	abb0052289	x86: ioapic: Move trigger defines to io_apic.h Required for devicetree based io_apic configuration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 19:58:09 +01:00
Thomas Gleixner	41098ffe05	x86: ioapic: Make a few functions static No users outside of io_apic.c. Mark bad_ioapic() __init while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 17:26:51 +01:00
Thomas Gleixner	ff973d041e	x86: ioapic: Add io_apic_setup_irq_pin() There are about four places in the ioapic code which do exactly the same setup sequence. Also the OF based ioapic setup needs that function to avoid putting the OF specific code into ioapic.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 17:26:49 +01:00
Thomas Gleixner	939d578ecc	x86: OLPC: Make OLPC=n build again Stupid me missed the functions called from setup.c. Add the stubs back for OLPC=n Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 11:54:02 +01:00
Henrik Kretzschmar	7d0f192613	x86: Add dummy functions for compiling without IOAPIC This patch adds IOAPIC dummy functions for compilation with local APIC, but without IOAPIC. The local variable ioapic_entries in enable_IR_x2apic() does not need initialization anymore, since the dummy returns NULL. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-4-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:38:46 +01:00
Henrik Kretzschmar	7167d08e78	x86: Rework arch_disable_smp_support() for x86 Currently arch_disable_smp_support() on x86 disables only the support for the IOAPIC and is also compiled in if SMP-support is not. Therefore this function is renamed to disable_ioapic_support(), which meets its purpose and is only compiled in the kernel when IOAPIC support is also. A new arch_disable_smp_support() is created in smpboot.c, which calls disable_ioapic_support() and gets only compiled in the kernel when SMP support is also. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-3-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:38:45 +01:00
Henrik Kretzschmar	b6a1432da8	x86: Add dummy mp_save_irq() This is a dummy function, used when no IOAPIC is compiled in. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-2-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:38:45 +01:00
Henrik Kretzschmar	4e034b2451	x86: Move ioapic_irq_destination_types to apicdef.h This enum is used by non IOAPIC code, so apicdef.h is the best place for it. Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de> LKML-Reference: <1298385487-4708-1-git-send-email-henne@nachtwindheim.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-23 11:38:44 +01:00
Thomas Gleixner	c2a941fadb	x86: OLPC: Remove extra OLPC_OPENFIRMWARE_DT indirection OLPC_OPENFIRMWARE_DT is just there to be selected by OLPC and selects OF_PROMTREE. So let OLPC select OF_PROMTREE and remove that extra config indirection. Fixup code and Makefile and use CONFIG_OF_PROMTREE instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andres Salomon <dilinger@queued.net>	2011-02-23 10:40:45 +01:00
Thomas Gleixner	dc3119e700	x86: OLPC: Cleanup config maze completely Neither CONFIG_OLPC_OPENFIRMWARE nor CONFIG_OLPC_OPENFIRMWARE_DT are really necessary. OLPC selects OLPC_OPENFIRMWARE unconditionally, so move the "select OF" part under OLPC config option and fixup the dependencies in Makefiles and code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andres Salomon <dilinger@queued.net>	2011-02-23 10:40:45 +01:00
Thomas Gleixner	7acdbb3f35	Merge branch 'linus' into x86/platform Reason: Import mainline device tree changes on which further patches depend on or conflict. Trivial conflict in: drivers/spi/pxa2xx_spi_pci.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-23 09:21:41 +01:00
Thomas Gleixner	695884fb8a	Merge branch 'devicetree/for-x86' of git://git.secretlab.ca/git/linux-2.6 into x86/platform Reason: x86 devicetree support for ce4100 depends on those device tree changes scheduled for .39. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-22 18:41:48 +01:00
Borislav Petkov	2b15cd96e5	x86, system.h: Drop unused __SAVE/__RESTORE macros Those are unused since at least the beginning of git history. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <1298044056-31104-1-git-send-email-bp@amd64.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-20 13:38:47 +01:00
Konrad Rzeszutek Wilk	cc0f89c4a4	pci/xen: Cleanup: convert int** to int[] Cleanup code. Cosmetic change to make the code look easier to read. Reviewed-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-02-18 12:41:49 -05:00
Jan Beulich	02ca752e41	x86: Remove die_nmi() With no caller left, the function and the DIE_NMIWATCHDOG enumerator can both go away. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> LKML-Reference: <4D5D521C0200007800032702@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-18 08:54:05 +01:00
H. Peter Anvin	3d35ac346e	x86, reboot: Move the real-mode reboot code to an assembly file Move the real-mode reboot code out to an assembly file (reboot_32.S) which is allocated using the common lowmem trampoline allocator. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <4D5DFBE4.7090104@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Matthieu Castet <castet.matthieu@free.fr>	2011-02-17 21:05:34 -08:00
H. Peter Anvin	014eea518a	x86: Make the GDT_ENTRY() macro in <asm/segment.h> safe for assembly Make the GDT_ENTRY() macro in <asm/segment.h> safe for use in assembly code by guarding the ULL suffixes with _AC() macros. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <4D5DFBE4.7090104@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Matthieu Castet <castet.matthieu@free.fr> Cc: Stephen Rothwell <sfr@canb.auug.org.au>	2011-02-17 21:05:13 -08:00
H. Peter Anvin	d1ee433539	x86, trampoline: Use the unified trampoline setup for ACPI wakeup Use the unified trampoline allocation setup to allocate and install the ACPI wakeup code in low memory. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> LKML-Reference: <4D5DFBE4.7090104@intel.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Matthieu Castet <castet.matthieu@free.fr> Cc: Stephen Rothwell <sfr@canb.auug.org.au>	2011-02-17 21:05:06 -08:00
Len Brown	bfb53ccf1c	intel_idle: disable Atom/Lincroft HW C-state auto-demotion Just as we had to disable auto-demotion for NHM/WSM, we need to do the same for Atom (Lincroft version). In particular, auto-demotion will prevent Lincroft from entering the S0i3 idle power saving state. https://bugzilla.kernel.org/show_bug.cgi?id=25252 Signed-off-by: Len Brown <len.brown@intel.com>	2011-02-17 17:08:48 -05:00
Len Brown	14796fca2b	intel_idle: disable NHM/WSM HW C-state auto-demotion Hardware C-state auto-demotion is a mechanism where the HW overrides the OS C-state request, instead demoting to a shallower state, which is less expensive, but saves less power. Modern Linux should generally get exactly the states it requests. In particular, when a CPU is taken off-line, it must not be demoted, else it can prevent the entire package from reaching deep C-states. https://bugzilla.kernel.org/show_bug.cgi?id=25252 Signed-off-by: Len Brown <len.brown@intel.com>	2011-02-17 17:08:46 -05:00
Tejun Heo	e23bba6044	x86-64, NUMA: Unify emulated distance mapping NUMA emulation needs to update node distance information. It did it by remapping apicid to PXM mapping, even when amdtopology is being used. There is no reason to go through such convolution. The generic code has all the information necessary to transform the distance table to the emulated nid space. Implement generic distance table transformation in numa_emulation() and drop private implementations in srat_64 and amdtopology_64. This makes find_node_by_addr() and fake_physnodes() and related functions unnecessary, drop them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:10 +01:00
Tejun Heo	ac7136b611	x86-64, NUMA: Implement generic node distance handling Node distance either used direct node comparison, ACPI PXM comparison or ACPI SLIT table lookup. This patch implements generic node distance handling. NUMA init methods can call numa_set_distance() to set distance between nodes and the common __node_distance() implementation will report the set distance. Due to the way NUMA emulation is implemented, the generic node distance handling is used only when emulation is not used. Later patches will update NUMA emulation to use the generic distance mechanism. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:09 +01:00
Tejun Heo	4697bdcc94	x86-64, NUMA: Kill mem_nodes_parsed With all memory configuration information now carried in numa_meminfo, there's no need to keep mem_nodes_parsed separate. Drop it and use numa_nodes_parsed for CPU / memory-less nodes. A new helper numa_nodemask_from_meminfo() is added to calculate memnode mask on the fly which is currently used to set node_possible_map. This simplifies NUMA init methods a bit and removes a source of possible inconsistencies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:09 +01:00
Tejun Heo	92d4a4371e	x86-64, NUMA: Rename cpu_nodes_parsed to numa_nodes_parsed It's no longer necessary to keep both cpu_nodes_parsed and mem_nodes_parsed. In preparation for merge, rename cpu_nodes_parsed to numa_nodes_parsed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:09 +01:00
Tejun Heo	91556237ec	x86-64, NUMA: Kill numa_nodes[] numa_nodes[] doesn't carry any information which isn't present in numa_meminfo. Each entry is simply min/max range of all the memblks for the node. This is not only redundant but also inaccurate when memblks for different nodes interleave - for example, find_node_by_addr() can return the wrong nodeid. Kill numa_nodes[] and always use numa_meminfo instead. * nodes_cover_memory() is renamed to numa_meminfo_cover_memory() and now operations on numa_meminfo and returns bool. * setup_node_bootmem() needs min/max range. Compute the range on the fly. setup_node_bootmem() invocation is restructured to use outer loop instead of hardcoding the double invocations. * find_node_by_addr() now operates on numa_meminfo. * setup_physnodes() builds physnodes[] from memblks. This will go away when emulation code is updated to use struct numa_meminfo. This patch also makes the following misc changes. * Clearing of nodes_add[] clearing is converted to memset(). * numa_add_memblk() in amd_numa_init() is moved down a bit for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:09 +01:00
Tejun Heo	a844ef46fa	x86-64, NUMA: Add common find_node_by_addr() srat_64.c and amdtopology_64.c had their own versions of find_node_by_addr() which were basically the same. Add common one in numa_64.c and remove the duplicates. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:09 +01:00
Tejun Heo	5d371b08fe	x86-64, NUMA: Kill {acpi\|amd\|dummy}_scan_nodes() They are empty now. Kill them. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:08 +01:00
Tejun Heo	43a662f04f	x86-64, NUMA: Unify use of memblk in all init methods Make both amd and dummy use numa_add_memblk() to describe the detected memory blocks. This allows initmem_init() to call numa_register_memblk() regardless of init method in use. Drop custom memory registration codes from amd and dummy. After this change, memblk merge/cleanup in numa_register_memblks() is applied to all init methods. As this makes compute_hash_shift() and numa_register_memblks() used only inside numa_64.c, make them static. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:08 +01:00
Tejun Heo	ef396ec96c	x86-64, NUMA: Factor out memblk handling into numa_{add\|register}_memblk() Factor out memblk handling from srat_64.c into two functions in numa_64.c. This patch doesn't introduce any behavior change. The next patch will make all init methods use these functions. - v2: Fixed build failure on 32bit due to misplaced NR_NODE_MEMBLKS. Reported by Ingo. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 17:11:07 +01:00
Ingo Molnar	a3ec4a603f	Merge commit 'v2.6.38-rc5' into core/locking Merge reason: pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:33:41 +01:00
Robert Richter	4979d2729a	perf, x86: Add support for AMD family 15h core counters This patch adds support for AMD family 15h core counters. There are major changes compared to family 10h. First, there is a new perfctr msr range for up to 6 counters. Northbridge counters are separate now. This patch only adds support for core counters. Second, certain events may only be scheduled on certain counters. For this we need to extend the event scheduling and constraints. We use cpu feature flags to calculate family 15h msr address offsets. This way we later can implement a faster ALTERNATIVE() version for this. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110215135210.GB5874@erda.amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 13:30:53 +01:00
Cyrill Gorcunov	7d44ec193d	perf, x86: P4 PMU: Fix spurious NMI messages Several people have reported spurious unknown NMI messages on some P4 CPUs. This patch fixes it by checking for an overflow (negative counter values) directly, instead of relying on the P4_CCCR_OVF bit. Reported-by: George Spelvin <linux@horizon.com> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Don Zickus <dzickus@redhat.com> Reported-by: Dave Airlie <airlied@gmail.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Lin Ming <ming.m.lin@intel.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <AANLkTinfuTfCck_FfaOHrDqQZZehtRzkBum4SpFoO=KJ@mail.gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 12:26:12 +01:00
Tejun Heo	1909554870	x86-64, NUMA: Kill {acpi\|amd}_get_nodes() With common numa_nodes[], common code in numa_64.c can access it directly. Copy directly and kill {acpi\|amd}_get_nodes(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:07 +01:00
Tejun Heo	206e42087a	x86-64, NUMA: Use common numa_nodes[] ACPI and amd are using separate nodes[] array. Add numa_nodes[] and use them in all NUMA init methods. cutoff_node() cleanup is moved from srat_64.c to numa_64.c and applied in initmem_init() regardless of init methods. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:07 +01:00
Tejun Heo	ec8cf29b1d	x86-64, NUMA: Use common {cpu\|mem}_nodes_parsed ACPI and amd are using separate nodes_parsed masks. Add {cpu\|mem}_nodes_parsed and use them in all NUMA init methods. Initialization of the masks and building node_possible_map are now handled commonly by initmem_init(). dummy_numa_init() is updated to set node 0 on both masks. While at it, move the info messages from scan to init. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:07 +01:00
Tejun Heo	d8fc3afc49	x86, NUMA: Move *_numa_init() invocations into initmem_init() There's no reason for these to live in setup_arch(). Move them inside initmem_init(). - v2: x86-32 initmem_init() weren't updated breaking 32bit builds. Fixed. Found by Ankita. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ankita Garg <ankita@in.ibm.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:06 +01:00
Tejun Heo	a9aec56afa	x86-64, NUMA: Wrap acpi_numa_init() so that failure can be indicated by return value Because of the way ACPI tables are parsed, the generic acpi_numa_init() couldn't return failure when error was detected by arch hooks. Instead, the failure state was recorded and later arch dependent init hook - acpi_scan_nodes() - would fail. Wrap acpi_numa_init() with x86_acpi_numa_init() so that failure can be indicated as return value immediately. This is in preparation for further NUMA init cleanups. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:06 +01:00
Tejun Heo	940fed2e79	x86-64, NUMA: Unify {acpi\|amd}_{numa_init\|scan_nodes}() arguments and return values The functions used during NUMA initialization - _numa_init() and _scan_nodes() - have different arguments and return values. Unify them such that they all take no argument and return 0 on success and -errno on failure. This is in preparation for further NUMA init cleanups. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:06 +01:00
Tejun Heo	86ef4dbf1f	x86, NUMA: Drop @start/last_pfn from initmem_init() initmem_init() extensively accesses and modifies global data structures and the parameters aren't even followed depending on which path is being used. Drop @start/last_pfn and let it deal with @max_pfn directly. This is in preparation for further NUMA init cleanups. - v2: x86-32 initmem_init() weren't updated breaking 32bit builds. Fixed. Found by Yinghai. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@linux.intel.com>	2011-02-16 12:13:06 +01:00
Ingo Molnar	275a88d3cf	Merge branch 'x86/amd-nb' into x86/mm Merge reason: consolidate it into the more generic x86/mm tree to prevent conflicts with ongoing NUMA work. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 09:45:47 +01:00
Ingo Molnar	52b8b8d725	Merge branch 'x86/numa' into x86/mm Merge reason: consolidate it into the more generic x86/mm tree to prevent conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 09:44:15 +01:00
Ingo Molnar	02ac81a812	Merge branch 'x86/bootmem' into x86/mm Merge reason: the topic is ready - consolidate it into the more generic x86/mm tree and prevent conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-16 09:43:54 +01:00
Feng Tang	6b617e224d	x86/platform: Add a wallclock_init func to x86_init.timers ops Some wall clock devices use MMIO based HW register, this new function will give them a chance to do some initialization work before their get/set_time service get called, which is usually in early kernel boot phase. Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-14 18:20:43 +01:00
Ingo Molnar	b366801c95	Merge commit 'v2.6.38-rc4' into x86/numa Merge reason: Merge latest fixes before applying new patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 13:28:31 +01:00
Ingo Molnar	91e04ec058	Merge commit 'v2.6.38-rc4' into x86/cpu Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 13:18:56 +01:00
Shaohua Li	70e4a36973	x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 Make the maxium TLB invalidate vectors depend on NR_CPUS linearly, with a maximum of 32 vectors. We currently only have 8 vectors for TLB invalidation and that is clearly inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and tlbstate_lock is used to protect them. flush_tlb_page() is heavily used in page reclaim, which will cause a lot of lock contention for tlbstate_lock. Andi Kleen suggested increasing the vectors number to 32, which should be good for current typical systems to reduce the tlbstate_lock contention. My test system has 4 sockets and 64G memory, and 64 CPUs. My workload creates 64 processes. Each process mmap reads a big empty sparse file. The total size of the files are 2*total_mem, so this will cause a lot of page reclaim. Below is the result I get from perf call-graph profiling: without the patch: ------------------ 24.25% usemem [kernel] [k] _raw_spin_lock \| --- _raw_spin_lock \| \|--42.15%-- native_flush_tlb_others with the patch: ------------------ 14.96% usemem [kernel] [k] _raw_spin_lock \| --- _raw_spin_lock \|--13.89%-- native_flush_tlb_others So this heavily reduces the tlbstate_lock contention. Suggested-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1295232727.1949.709.camel@sli10-conroe> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 13:03:08 +01:00
Shaohua Li	3a09fb4570	x86: Allocate 32 tlb_invalidate_interrupt handler stubs Add up to 32 invalidate_interrupt handlers. How many handlers are added depends on NUM_INVALIDATE_TLB_VECTORS. So if NUM_INVALIDATE_TLB_VECTORS is smaller than 32, we reduce code size. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <1295232725.1949.708.camel@sli10-conroe> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 13:03:08 +01:00
Shaohua Li	60f6e65d78	x86: Cleanup vector usage Cleanup the vector usage and make them continuous if possible. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> LKML-Reference: <1295232722.1949.707.camel@sli10-conroe> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 13:03:07 +01:00
Borislav Petkov	1c9d16e359	x86: Fix mwait_usable section mismatch We use it in non __cpuinit code now too so drop marker. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <20110211171754.GA21047@aftab> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 12:08:28 +01:00
Ingo Molnar	d2137d5af4	Merge branch 'linus' into x86/bootmem Conflicts: arch/x86/mm/numa_64.c Merge reason: fix the conflict, update to latest -rc and pick up this dependent fix from Yinghai: e6d2e2b2b1e1: memblock: don't adjust size in memblock_find_base() Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-14 11:55:18 +01:00
Jan Beulich	691269f0d9	x86: Adjust section placement in AMD northbridge related code amd_nb_misc_ids[] can live in .rodata, and enable_pci_io_ecs() can be moved into .cpuinit.text. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: Andreas Herrmann <Andreas.Herrmann3@amd.com> Cc: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <4D525DDD0200007800030F07@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-10 13:32:52 +01:00
Jan Beulich	2fb270f321	x86: Fix section mismatch in LAPIC initialization Additionally doing things conditionally upon smp_processor_id() being zero is generally a bad idea, as this means CPU 0 cannot be offlined and brought back online later again. While there may be other places where this is done, I think adding more of those should be avoided so that some day SMP can really become "symmetrical". Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <4D525C7E0200007800030EE1@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-10 13:26:53 +01:00
Hans Rosenfeld	cabb5bd7ff	x86, amd: Support L3 Cache Partitioning on AMD family 0x15 CPUs L3 Cache Partitioning allows selecting which of the 4 L3 subcaches can be used for evictions by the L2 cache of each compute unit. By writing a 4-bit hexadecimal mask into the the sysfs file /sys/devices/system/cpu/cpuX/cache/index3/subcaches, the user can set the enabled subcaches for a CPU. The settings are directly read from and written to the hardware, so there is no way to have contradicting settings for two CPUs belonging to the same compute unit. Writing will always overwrite any previous setting for a compute unit. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: <Andreas.Herrmann3@amd.com> LKML-Reference: <1297098639-431383-1-git-send-email-hans.rosenfeld@amd.com> [ -v3: minor style fixes ] Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-02-07 19:16:22 +01:00
Linus Torvalds	07675f484b	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86-32: Make sure the stack is set up before we use it x86, mtrr: Avoid MTRR reprogramming on BP during boot on UP platforms x86, nx: Don't force pages RW when setting NX bits	2011-02-06 12:03:10 -08:00
H. Peter Anvin	11d4c3f9b6	x86-32: Make sure the stack is set up before we use it Since checkin `ebba638ae7` we call verify_cpu even in 32-bit mode. Unfortunately, calling a function means using the stack, and the stack pointer was not initialized in the 32-bit setup code! This code initializes the stack pointer, and simplifies the interface slightly since it is easier to rely on just a pointer value rather than a descriptor; we need to have different values for the segment register anyway. This retains start_stack as a virtual address, even though a physical address would be more convenient for 32 bits; the 64-bit code wants the other way around... Reported-by: Matthieu Castet <castet.matthieu@free.fr> LKML-Reference: <4D41E86D.8060205@free.fr> Tested-by: Kees Cook <kees.cook@canonical.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-02-04 22:27:28 -08:00
Suresh Siddha	831d52bc15	x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb IPI's while the cr3 is still pointing to the prev mm. And this window can lead to the possibility of bogus TLB fills resulting in strange failures. One such problematic scenario is mentioned below. T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI etc between the point of clearing the cpu from the mm_cpumask(mm1) and before reloading the cr3 with the new mm2. T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with flushing the TLB for mm1. It doesn't send the flush TLB to CPU-1 as it doesn't see that cpu listed in the mm_cpumask(mm1). T3. After the TLB flush is complete, CPU-2 goes ahead and frees the page-table pages associated with the removed vma mapping. T4. CPU-2 now allocates those freed page-table pages for something else. T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1 can potentially speculate and walk through the page-table caches and can insert new TLB entries. As the page-table pages are already freed and being used on CPU-2, this page walk can potentially insert a bogus global TLB entry depending on the (random) contents of the page that is being used on CPU-2. T6. This bogus TLB entry being global will be active across future CR3 changes and can result in weird memory corruption etc. To avoid this issue, for the prev mm that is handing over the cpu to another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is changed. Marking it for -stable, though we haven't seen any reported failure that can be attributed to this. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: stable@kernel.org [v2.6.32+] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-03 13:32:39 -08:00
Richard Cochran	ce26efdefa	x86: Add clock_adjtime for x86 This patch adds the clock_adjtime system call to the x86 architecture. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Acked-by: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20110201134419.968905083@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-02-02 15:28:19 +01:00
Tejun Heo	eff9073790	x86: Rename incorrectly named parameter of numa_cpu_node() numa_cpu_node() prototype in numa_32.h has wrongly named parameter @apicid when it actually takes the CPU number. Change it to @cpu. Reported-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20110131155905.GM7459@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-31 18:22:25 +01:00
Thomas Gleixner	51563cd53c	Merge branch 'tip/rtmutex' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into core/locking *git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace tip/rtmutex: rtmutex: Simplify PI algorithm and make highest prio task get lock	2011-01-31 15:09:14 +01:00
Tejun Heo	4e62445b90	x86: Fix build failure on X86_UP_APIC Commit `4c321ff8` (x86: Replace cpu_2_logical_apicid[] with early percpu variable) and following changes introduced and used x86_cpu_to_logical_apicid percpu variable. It was declared and defined inside CONFIG_SMP && CONFIG_X86_32 but if CONFIG_X86_UP_APIC is set UP configuration makes use of it and build fails. Fix it by declaring and defining it inside CONFIG_X86_LOCAL_APIC && CONFIG_X86_32. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ingo Molnar <mingo@elte.hu> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: penberg@kernel.org Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <20110128162248.GA25746@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 17:24:49 +01:00
Tejun Heo	8db78cc4b4	x86: Unify NUMA initialization between 32 and 64bit Now that everything else is unified, NUMA initialization can be unified too. * numa_init_array() and init_cpu_to_node() are moved from numa_64 to numa. * numa_32::initmem_init() is updated to call numa_init_array() and setup_arch() to call init_cpu_to_node() on 32bit too. * x86_cpu_to_node_map is now initialized to NUMA_NO_NODE on 32bit too. This is safe now as numa_init_array() will initialize it early during boot. This makes NUMA mapping fully initialized before setup_per_cpu_areas() on 32bit too and thus makes the first percpu chunk which contains all the static variables and some of dynamic area allocated with NUMA affinity correctly considered. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-17-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org>	2011-01-28 14:54:10 +01:00
Tejun Heo	de2d9445f1	x86: Unify node_to_cpumask_map handling between 32 and 64bit x86_32 has been managing node_to_cpumask_map explicitly from map_cpu_to_node() and friends in a rather ugly way. With previous changes, it's now possible to share the code with 64bit. * When CONFIG_NUMA_EMU is disabled, numa_add/remove_cpu() are implemented in numa.c and shared by 32 and 64bit. CONFIG_NUMA_EMU versions still live in numa_64.c. NUMA_EMU's dependency on 64bit is planned to be removed and the above should go away together. * identify_cpu() now calls numa_add_cpu() for 32bit too. This makes the explicit mask management from map_cpu_to_node() unnecessary. * The whole x86_32 specific map_cpu_to_node() chunk is no longer necessary. Dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-16-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: David Rientjes <rientjes@google.com> Cc: Shaohui Zheng <shaohui.zheng@intel.com>	2011-01-28 14:54:10 +01:00
Tejun Heo	645a79195f	x86: Unify CPU -> NUMA node mapping between 32 and 64bit Unlike 64bit, 32bit has been using its own cpu_to_node_map[] for CPU -> NUMA node mapping. Replace it with early_percpu variable x86_cpu_to_node_map and share the mapping code with 64bit. * USE_PERCPU_NUMA_NODE_ID is now enabled for 32bit too. * x86_cpu_to_node_map and numa_set/clear_node() are moved from numa_64 to numa. For now, on 32bit, x86_cpu_to_node_map is initialized with 0 instead of NUMA_NO_NODE. This is to avoid introducing unexpected behavior change and will be updated once init path is unified. * srat_detect_node() is now enabled for x86_32 too. It calls numa_set_node() and initializes the mapping making explicit cpu_to_node_map[] updates from map/unmap_cpu_to_node() unnecessary. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: penberg@kernel.org Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-15-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: David Rientjes <rientjes@google.com>	2011-01-28 14:54:09 +01:00
Tejun Heo	bbc9e2f452	x86: Unify cpu/apicid <-> NUMA node mapping between 32 and 64bit The mapping between cpu/apicid and node is done via apicid_to_node[] on 64bit and apicid_2_node[] + apic->x86_32_numa_cpu_node() on 32bit. This difference makes it difficult to further unify 32 and 64bit NUMA handling. This patch unifies it by replacing both apicid_to_node[] and apicid_2_node[] with __apicid_to_node[] array, which is accessed by two accessors - set_apicid_to_node() and numa_cpu_node(). On 64bit, numa_cpu_node() always consults __apicid_to_node[] directly while 32bit goes through apic->numa_cpu_node() method to allow apic implementations to override it. srat_detect_node() for amd cpus contains workaround for broken NUMA configuration which assumes relationship between APIC ID, HT node ID and NUMA topology. Leave it to access __apicid_to_node[] directly as mapping through CPU might result in undesirable behavior change. The comment is reformatted and updated to note the ugliness. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-14-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: David Rientjes <rientjes@google.com>	2011-01-28 14:54:09 +01:00
Tejun Heo	89e5dc218e	x86: Replace apic->apicid_to_node() with ->x86_32_numa_cpu_node() apic->apicid_to_node() is 32bit specific apic operation which determines NUMA node for a CPU. Depending on the APIC implementation, it can be easier to determine NUMA node from either physical or logical apicid. Currently, ->apicid_to_node() takes @logical_apicid and calls hard_smp_processor_id() if the physical apicid is needed. This prevents NUMA mapping from being queried from a different CPU, which in turn makes it impossible to initialize NUMA mapping before SMP bringup. This patch replaces apic->apicid_to_node() with ->x86_32_numa_cpu_node() which takes @cpu, from which both logical and physical apicids can easily be determined. While at it, drop duplicate implementations from bigsmp_32 and summit_32, and use the default one. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-13-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:08 +01:00
Tejun Heo	acb8bc09c6	x86: Add apic->x86_32_early_logical_apicid() On x86_32, the mapping between cpu and logical apic ID differs depending on the specific apic implementation in use. The mapping is initialized while bringing up CPUs; however, this makes early inits ignore memory topology. Add a x86_32 specific apic->x86_32_early_logical_apicid() which is called early during boot to query the mapping. The mapping is later verified against the result of init_apic_ldr(). The method is allowed to return BAD_APICID if it can't be determined early. noop variant which always returns BAD_APICID is implemented and added to all x86_32 apic implementations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: penberg@kernel.org Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-8-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:06 +01:00
Tejun Heo	7632611f53	x86: Kill apic->cpu_to_logical_apicid() After the previous patch, apic->cpu_to_logical_apicid() is no longer used. Kill it. For apic types with custom cpu_to_logical_apicid() which is also used for other purposes, remove the function and modify its users to do the mapping directly. #ifdef's on CONFIG_SMP in es7000_32 and summit_32 are ignored during conversion as they are not used for UP kernels. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: penberg@kernel.org Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-7-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:06 +01:00
Tejun Heo	4c321ff8a0	x86: Replace cpu_2_logical_apicid[] with early percpu variable Unlike x86_64, on x86_32, the mapping from cpu to logical apicid may vary depending on apic in use. cpu_2_logical_apicid[] array is used for this mapping. Replace it with early percpu variable x86_cpu_to_logical_apicid to make it better aligned with other mappings. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: penberg@kernel.org Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-5-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:05 +01:00
Tejun Heo	1245e1668c	x86: Make default_send_IPI_mask_sequence/allbutself_logical() 32bit only Both functions are used only in 32bit. Put them inside CONFIG_X86_32. This is to prepare for logical apicid handling update. - Cyrill Gorcunov spotted that I forgot to move declarations in ipi.h under CONFIG_X86_32. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: eric.dumazet@gmail.com Cc: brgerst@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-4-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:05 +01:00
Tejun Heo	b78aa66b1f	x86: Drop x86_32 MAX_APICID Commit `56d91f13` (x86, acpi: Add MAX_LOCAL_APIC for 32bit) added MAX_LOCAL_APIC for x86_32 but didn't replace MAX_APICID users with it. Convert MAX_APICID users to MAX_LOCAL_APIC and drop MAX_APICID. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: eric.dumazet@gmail.com Cc: yinghai@kernel.org Cc: brgerst@gmail.com Cc: gorcunov@gmail.com Cc: shaohui.zheng@intel.com Cc: rientjes@google.com LKML-Reference: <1295789862-25482-3-git-send-email-tj@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-28 14:54:04 +01:00
Linus Torvalds	f7b548fa3d	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: percpu, x86: Fix percpu_xchg_op() x86: Remove left over system_64.h x86-64: Don't use pointer to out-of-scope variable in dump_trace()	2011-01-28 06:43:41 +10:00
Thomas Gleixner	aac72277fd	rwsem: Move duplicate function prototypes to linux/rwsem.h All architecture specific rwsem headers carry the same function prototypes. Just x86 adds asmregparm, which is an empty define on all other architectures. S390 has a stale rwsem_downgrade_write() prototype. Remove the duplicates and add the prototypes to linux/rwsem.h Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Richard Henderson <rth@twiddle.net> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.970840140@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:30:39 +01:00
Thomas Gleixner	41e5887fa3	rwsem: Unify the duplicate rwsem_is_locked() inlines Instead of having the same implementation in each architecture, move it to linux/rwsem.h and remove the duplicates. It's unlikely that an arch will ever implement something different, but we can deal with that when it happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matt Turner <mattst88@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.876773757@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:30:39 +01:00
Thomas Gleixner	12249b3441	rwsem: Move duplicate init macros and functions to linux/rwsem.h The rwsem initializers and related macros and functions are mostly the same. Some of them lack the lockdep initializer, but having it in place does not matter for architectures which do not support lockdep. powerpc, sparc, x86: No functional change sh, s390: Removes the duplicate init_rwsem (inline and #define) alpha, ia64, xtensa: Use the lockdep capable init function in lib/rwsem.c which is just uninlining the init function for the LOCKDEP=n case Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matt Turner <mattst88@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.771812729@linutronix.de>	2011-01-27 12:30:39 +01:00
Thomas Gleixner	1c8ed640d9	rwsem: Move duplicate struct rwsem declaration to linux/rwsem.h The difference between these declarations is the data type of the count member and the lack of lockdep in some architectures/ long is equivivalent to signed long and the #ifdef guarded dep_map member does not hurt anyone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matt Turner <mattst88@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.679641914@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2011-01-27 12:30:39 +01:00
Thomas Gleixner	bde11efbc2	x86: Cleanup rwsem_count_t typedef Remove the typedef which has no real reason to be there. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.580335506@linutronix.de>	2011-01-27 12:30:38 +01:00
Thomas Gleixner	c16a87ce06	rwsem: Cleanup includes All rwsem implementations include the same headers. Include them from include/linux/rwsem.h Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Matt Turner <mattst88@gmail.com> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: David Miller <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> LKML-Reference: <20110126195833.483520950@linutronix.de>	2011-01-27 12:30:38 +01:00
Yinghai Lu	b3d7336db5	x86: Move llc_shared_map out of cpu_info cpu_info is already with per_cpu, We can take llc_shared_map out of cpu_info, and declare it as per_cpu variable directly. So later referencing could be simple and directly instead of diving to find cpu_info at first. Also could make smp_store_cpu_info() much simple to avoid to do save and restore trick. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: Alok N Kataria <akataria@vmware.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Hans J. Koch <hjk@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <4D3A16E8.5020608@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 08:44:49 +01:00
Hans Rosenfeld	41b2610c34	x86, amd: Extend AMD northbridge caching code to support "Link Control" devices "Link Control" devices (NB function 4) will be used by L3 cache partitioning on family 0x15. Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: <andreas.herrmann3@amd.com> LKML-Reference: <1295881543-572552-4-git-send-email-hans.rosenfeld@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 08:28:23 +01:00
Eric Dumazet	889a7a6a5d	percpu, x86: Fix percpu_xchg_op() These recent percpu commits: 2485b6464cf8: x86,percpu: Move out of place 64 bit ops into X86_64 section 8270137a0d50: cpuops: Use cmpxchg for xchg to avoid lock semantics Caused this 'perf top' crash: Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D 2.6.38-rc2-00181-gef71723 #413 Call Trace: <IRQ> [<ffffffff810465b5>] ? panic ? kmsg_dump ? kmsg_dump ? oops_end ? no_context ? __bad_area_nosemaphore ? perf_output_begin ? bad_area_nosemaphore ? do_page_fault ? __task_pid_nr_ns ? perf_event_tid ? __perf_event_header__init_id ? validate_chain ? perf_output_sample ? trace_hardirqs_off ? page_fault ? irq_work_run ? update_process_times ? tick_sched_timer ? tick_sched_timer ? __run_hrtimer ? hrtimer_interrupt ? account_system_vtime ? smp_apic_timer_interrupt ? apic_timer_interrupt ... Looking at assembly code, I found: list = this_cpu_xchg(irq_work_list, NULL); gives this wrong code : (gcc-4.1.2 cross compiler) ffffffff810bc45e: mov %gs:0xead0,%rax cmpxchg %rax,%gs:0xead0 jne ffffffff810bc45e <irq_work_run+0x3e> test %rax,%rax je ffffffff810bc4aa <irq_work_run+0x8a> Tell gcc we dirty eax/rax register in percpu_xchg_op() Compiler must use another register to store pxo_new__ We also dont need to reload percpu value after a jump, since a 'failed' cmpxchg already updated eax/rax Wrong generated code was : xor %rax,%rax /* load 0 into %rax / 1: mov %gs:0xead0,%rax cmpxchg %rax,%gs:0xead0 jne 1b test %rax,%rax After patch : xor %rdx,%rdx / load 0 into %rdx */ mov %gs:0xead0,%rax 1: cmpxchg %rdx,%gs:0xead0 jne 1b: test %rax,%rax Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <1295973114.3588.312.camel@edumazet-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 08:10:49 +01:00
Yinghai Lu	9a57c3e487	x86: Remove left over system_64.h Left-over from the x86 merge ... Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4D3E23D1.7010405@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-26 08:05:58 +01:00
Andrea Arcangeli	cacf061c5e	thp: fix PARAVIRT x86 32bit noPAE This fixes TRANSPARENT_HUGEPAGE=y with PARAVIRT=y and HIGHMEM64=n. The #ifdef that this patch removes was erratically introduced to fix a build error for noPAE (where pmd.pmd doesn't exist). So then the kernel built but it failed at runtime because set_pmd_at was a noop. This will correct it by enabling set_pmd_at for noPAE mode too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: werner <w.landgraf@ru.ru> Reported-by: Minchan Kim <minchan.kim@gmail.com> Tested-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-26 10:49:57 +10:00
Linus Torvalds	4398f31ca7	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Fix jump label with RO/NX module protection crash x86, hotplug: Fix powersavings with offlined cores on AMD x86, mcheck, therm_throt.c: Export symbol platform_thermal_notify to allow coretemp to handler intr x86: Use asm-generic/cacheflush.h x86: Update CPU cache attributes table descriptors	2011-01-25 05:24:12 +10:00
matthieu castet	8969691343	x86: Fix jump label with RO/NX module protection crash If we use jump table in module init, there are marked as removed in __jump_table section after init is done. But we already applied ro permissions on the module, so we can't modify a read only section (crash in remove_jump_label_module_init). Make the __jump_table section rw. Signed-off-by: Matthieu CASTET <castet.matthieu@free.fr> Cc: Xiaotian Feng <xtfeng@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Siarhei Liakh <sliakh.lkml@gmail.com> Cc: Xuxian Jiang <jiang@cs.ncsu.edu> Cc: James Morris <jmorris@namei.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Dave Jones <davej@redhat.com> Cc: Kees Cook <kees.cook@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <4D3C3F20.7030203@free.fr> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-23 16:12:45 +01:00
Borislav Petkov	93789b32db	x86, hotplug: Fix powersavings with offlined cores on AMD `ea53069231` made a CPU use monitor/mwait when offline. This is not the optimal choice for AMD wrt to powersavings and we'd prefer our cores to halt (i.e. enter C1) instead. For this, the same selection whether to use monitor/mwait has to be used as when we select the idle routine for the machine. With this patch, offlining cores 1-5 on a X6 machine allows core0 to boost again. [ hpa: putting this in urgent since it is a (power) regression fix ] Reported-by: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: stable@kernel.org # 37.x Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.hl> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <1295534572-10730-1-git-send-email-bp@amd64.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-01-21 18:14:54 -08:00
Linus Torvalds	ebe0d80507	Merge branch 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: x86,percpu: Move out of place 64 bit ops into X86_64 section	2011-01-21 13:43:21 -08:00
Akinobu Mita	cc67ba6352	x86: Use asm-generic/cacheflush.h The implementation of the cache flushing interfaces on the x86 is identical with the default implementation in asm-generic. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: arnd@arndb.de LKML-Reference: <1295523136-4277-2-git-send-email-akinobu.mita@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-21 14:11:12 +01:00
Jan Beulich	9032160275	x86: Unify "numa=" command line option handling In order to be able to suppress the use of SRAT tables that 32-bit Linux can't deal with (in one case known to lead to a non-bootable system, unless disabling ACPI altogether), move the "numa=" option handling to common code. Signed-off-by: Jan Beulich <jbeulich@novell.com> Reviewed-by: Thomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Renninger <trenn@suse.de> LKML-Reference: <4D36B581020000780002D0FF@vpn.id2.novell.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-01-19 10:25:18 +01:00
Linus Torvalds	dc8e7e3ec6	Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 * 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6: cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer intel_idle: open broadcast clock event cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW cpuidle: delete NOP CPUIDLE_FLAG_POLL ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL ACPI, intel_idle: Cleanup idle= internal variables cpuidle: Make cpuidle_enable_device() call poll_idle_init() intel_idle: update Sandy Bridge core C-state residency targets	2011-01-13 20:15:18 -08:00
Linus Torvalds	9c4bc1c2be	Merge branch 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/p2m: Fix module linking error. xen p2m: clear the old pte when adding a page to m2p_override xen gntdev: use gnttab_map_refs and gnttab_unmap_refs xen: introduce gnttab_map_refs and gnttab_unmap_refs xen p2m: transparently change the p2m mappings in the m2p override xen/gntdev: Fix circular locking dependency xen/gntdev: stop using "token" argument xen: gntdev: move use of GNTMAP_contains_pte next to the map_op xen: add m2p override mechanism xen: move p2m handling to separate file xen/gntdev: add VM_PFNMAP to vma xen/gntdev: allow usermode to map granted pages xen: define gnttab_set_map_op/unmap_op Fix up trivial conflict in drivers/xen/Kconfig	2011-01-13 18:46:48 -08:00
Andrea Arcangeli	8ee53820ed	thp: mmu_notifier_test_young For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's correct to return 0 when shadow_access_mask is zero, it requires gup-fast to set the referenced bit). qemu-kvm access already sets the young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we relay on gup-fast to signal the page is in use... We also need to check the young bits on the secondary pagetables for NPT and not nested shadow mmu as the data may never get accessed again by the primary pte. Without this closer accuracy, we'd have to remove the heuristic that avoids collapsing hugepages in hugepage virtual regions that have not even a single subpage in use. ->test_young is full backwards compatible with GRU and other usages that don't have young bits in pagetables set by the hardware and that should nuke the secondary mmu mappings when ->clear_flush_young runs just like EPT does. Removing the heuristic that checks the young bit in khugepaged/collapse_huge_page completely isn't so bad either probably but I thought it was worth it and this makes it reliable. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:46 -08:00
Andrea Arcangeli	4b7167b9ff	thp: don't allow transparent hugepage support without PSE Archs implementing Transparent Hugepage Support must implement a function called has_transparent_hugepage to be sure the virtual or physical CPU supports Transparent Hugepages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:45 -08:00
Johannes Weiner	c489f1257b	thp: add pmd_modify Add pmd_modify() for use with mprotect() on huge pmds. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Johannes Weiner	f2d6bfe9ff	thp: add x86 32bit support Add support for transparent hugepages to x86 32bit. Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support transparent hugepages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:44 -08:00
Andrea Arcangeli	71e3aac072	thp: transparent hugepage core Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL102410241024) int main() { char p = malloc(SIZE), p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:42 -08:00
Andrea Arcangeli	db3eb96f4e	thp: add pmd mangling functions to x86 Add needed pmd mangling functions with symmetry with their pte counterparts. pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's needed to serialize the VM against split_huge_page. It simply atomically sets the splitting bit in a similar way pmdp_clear_flush_young atomically clears the accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it effective against gup_fast, but it wouldn't really require to flush the tlb too. Just the tlb flush is the simplest operation we can invoke to serialize pmdp_splitting_flush() against gup_fast. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:40 -08:00
Andrea Arcangeli	5f6e8da70a	thp: special pmd_trans_* functions These returns 0 at compile time when the config option is disabled, to allow gcc to eliminate the transparent hugepage function calls at compile time without additional #ifdefs (only the export of those functions have to be visible to gcc but they won't be required at link time and huge_memory.o can be not built at all). _PAGE_BIT_UNUSED1 is never used for pmd, only on pte. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-01-13 17:32:40 -08:00

... 4 5 6 7 8 ...

2992 Commits