since we now have 32-bit support for e820_register_active_regions(),
we can merge the parsing of the mem=/memmap= boot parameters.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses reserve_bootmem_generic() instead of reserve_bootmem()
to reserve the crashkernel memory on x86_64. That's necessary for NUMA
machines, see 00212fef814612245ed0261cbac8426d0c9a31a5:
[PATCH] Fix kdump Crash Kernel boot memory reservation for NUMA machines
This patch will fix a boot memory reservation bug that trashes memory on
the ES7000 when loading the kdump crash kernel.
The code in arch/x86_64/kernel/setup.c to reserve boot memory for the crash
kernel uses the non-numa aware "reserve_bootmem" function instead of the
NUMA aware "reserve_bootmem_generic". I checked to make sure that no other
function was using "reserve_bootmem" and found none, except the ones that
had NUMA ifdef'ed out.
I have tested this patch only on an ES7000 with NUMA on and off (numa=off)
in a single (non-NUMA) and multi-cell (NUMA) configurations.
Signed-off-by: Amul Shah <amul.shah@unisys.com>
Looks-good-to: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The switch-back to reserve_bootmem() was accidentally introduced in
5c3391f9f7 when adding the BOOTMEM_EXCLUSIVE
parameter.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a 'flags' parameter to reserve_bootmem_generic() like it
already has been added in reserve_bootmem() with commit
72a7fe3967.
It also changes all users to use BOOTMEM_DEFAULT, which doesn't effectively
change the behaviour. Since the change is x86-specific, I don't think it's
necessary to add a new API for migration. There are only 4 users of that
function.
The change is necessary for the next patch, using reserve_bootmem_generic()
for crashkernel reservation.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
> That helped a lot, the system seems to work normally now.
>
> Here's the relevant snippet from dmesg:
>
> [ 0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [ 0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> [ 0.108006] ..... (found apic 0 pin 2) ...<3> failed.
> [ 0.108006] ...trying to set up timer as Virtual Wire IRQ...<3> works.
>
> and the whole thing is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-2.log
Hmm, that only proved the 8259A is indeed wired to the pin #2 of the I/O
APIC.
> I, personally, don't have any and AMD only has SB600 documentation on its
> web page (it's still marked as "AMD confidential" ;-)).
Well, the IC block is most likely the same as that's not rocket science
and once done there is no need to fiddle with that. That written, I am
afraid there is nothing useful about the IC in the document, except that
it's there and consists of an I/O APIC providing 24 inputs and the usual
pair of 8259A cores. Thanks for the reference anyway.
> There is an interrupt controller in there, but I'm not sure if there's any
> 8259A. The northbridge is on the CPU, actually.
I will praise the day someone ships an x86 machine without an 8259A core!
As expressed in another mail I suspect there may actually be a direct
route from the 8254 to INTIN0 in the southbridge -- this is what other
bootstrap logs seen in the Internet suggest. This would mean this
particular BIOS is buggy (is it the latest version?) and provides an
incorrect IRQ override in its ACPI tables, for example because the
responsible block has been blindly copied from a machine using a commoner
wiring. This could be moderately easily fixed up with a quirk based on
the PCI ID (after checking it again, we actually used to have a quirk for
ATI in this area, but the way it was done suggests the issue was not
understood well enough).
Could you please remove the hack sent yesterday and test the patch
provided below? I do hope it builds, but I have no immediate means to
check it. Please report the output. The intent is to test INTIN0
directly before testing INTIN2 through the 8259A. Thanks.
Aside of that, what I have gathered from your reports (please correct me
if I have got it wrong) is that when the through-8259A mode is used, then
after a while 8254 timer interrupts stop arriving. What's interesting,
the "Virtual Wire IRQ" seems to work for you correctly (that's quite an
odd setup where a local APIC input is used in the native mode -- please
post /proc/interrupts for confirmation), which in turn implies the master
8259A drives its INT output as we expect. Why would the I/O APIC input
have problems then? Hmm...
[ mingo@elte.hu: revert the "x86: fix IO APIC breakage on HP nx6325"
version. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Thu, 19 Jun 2008, Rafael J. Wysocki wrote:
> > With such a configuration the "x86: I/O APIC: timer through 8259A
> > second-chance" patch should not matter, because the only change it
> > introduces is an attempt to try the same I/O APIC pin again, but with the
> > IRQ0 line of the master 8259A enabled. That's not a terribly unusual
> > configuration and nothing should get confused in the system.
>
> But it _does_ get confused, really.
Something certainly gets confused, but so far I am not sure which bit
exactly it is, are you?
> > Barring the unlikely possibility of the 8259A actually being wired to
> > INTIN2 of the I/O APIC I can see two possible explanations:
> >
> > 1. The 8259A interrupt actually escapes to the CPU somehow and is handled
> > as an ExtINTA interrupt. This would make the code in check_timer()
> > decide it has found a working configuration, while actually it has been
> > fooled.
[...]
> Here you go:
>
> [ 0.108006] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.108006] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [ 0.108006] ...trying to set up timer (IRQ0) through the 8259A ... <3>
> [ 0.108006] ..... (found apic 0 pin 2) ...<3> works.
>
> The full dmesg is at: http://www.sisk.pl/kernel/debug/20080618/dmesg-1.log
Thanks. In this case I suspect the case #1 quoted above happens, that is
the 8259A manages to deliver its interrupt somehow. Note at this stage it
is meant to be in the AEOI mode, so it can happily resubmit the interrupt
indefinitely with no additional handling as long as it receives INTA
cycles.
Can you please try the patch below on top of "x86: I/O APIC: timer
through 8259A second-chance" to see whether my hypothesis is true? It
modifies the through-8259A setup path so that the APIC input gets masked,
but the 8259A has the timer interrupt still enabled. Let me know how the
timer interrupt is routed in this case.
Bisected-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If configured to use the I/O APIC, the NMI watchdog is deemed to fail if
the chip has been deactivated as a result of "nosmp". Downgrade to the
local APIC watchdog similarly to what is done for the UP case.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For the UP case the NMI watchdog downgrade is done consistently in
APIC_init_uniprocessor() now. Remove redundant code used only when
BIOS-disabled local APIC is activated.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If configured to use the I/O APIC, the NMI watchdog is deemed to fail if
the chip will not be used in the UP configuration, because "noapic" has
been specified or the chip is simply not there. Downgrade to the local
APIC watchdog to rectify.
The new #ifdef is ugly, I know. A proper solution is to provide suitable
definitions of smp_found_config, etc. for !CONFIG_X86_IO_APIC in a header.
Likewise the whole if () condition should be moved to a static inline
function. Such clean-ups are beyond the scope of this change and can be
done once the whole issue of the timer has been sorted out.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nmi_watchdog=1 hangs on 64-bit:
[ 0.250000] Detected 12.564 MHz APIC timer.
[ 0.254178] APIC timer registered as dummy, due to nmi_watchdog=1!
[ 0.260366] Testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
[ ... ]
[ 0.470003] calling genl_init+0x0/0xd0
[ hard hang ]
bisected it down to:
git-bisect start
git-bisect good 1beee8dc8c
git-bisect bad 11582ece0aaa2d0f94f345c08a4ab9997078a083
git-bisect bad 5479c623bb44089844022c03d4c0eb16d5b7a15f
git-bisect bad cfb4c7fabeb499e1c29f9d1878968e37a938e28a
git-bisect good 246dd412d3
git-bisect bad 3f8237eaff7dc1e35fa791dae095574fd974e671
git-bisect good 90e23b13ab849e2a11f00c655eb3a2011b4623be
git-bisect bad 833526a34eeefc117df3191a594c3c3a4f15a9ac
git-bisect good 791b93d3dfaf16c23e978bec0cc0a3dd9d855d63
git-bisect bad 65767c64068f2c93e56a1accfed5c78230ac12d7
git-bisect bad 2abc5c05dd82c188e3bdf6641a274f013348d14b
git-bisect bad 317e1f2597ffb4d4db940577bbe56dc6e881ef07
| 317e1f2597ffb4d4db940577bbe56dc6e881ef07 is first bad commit
| commit 317e1f2597ffb4d4db940577bbe56dc6e881ef07
| Author: Maciej W. Rozycki <macro@linux-mips.org>
| Date: Wed May 21 22:10:22 2008 +0100
| x86: I/O APIC: clean up the 8259A on a NMI watchdog failure
the problem is that in the dummy-lapic branch we rely on the i8259A
but if the NMI watchdog fails we turn off IRQ 0 - which doesnt work
too well ;-)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Not sure but maybe it is better to use NMI_DISABLED,
will take a look. But for now this patch is not change
anything in logic so it will not hurt/broke the kernel.
For most cases nmi_watchdog assignment is by one of NMI_*
macro so I think there it make sense too.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/io_apic_64.c: In function 'check_timer':
arch/x86/kernel/io_apic_64.c:1688: error: 'vector' undeclared (first use in this function)
arch/x86/kernel/io_apic_64.c:1688: error: (Each undeclared identifier is reported only once
arch/x86/kernel/io_apic_64.c:1688: error: for each function it appears in.)
Some systems incorrectly report the ExtINTA pin of the I/O APIC as the
genuine target of the timer interrupt. Here is a change that copies timer
pin information found to the other pin if one has been found only. This
way both a direct and a through-8259A route is tested with the pin letting
these problematic systems work well enough. If no timer pin information
has been found for the I/O APIC, then local APIC variations are tried
only, similarly to what is done without the change (except without the
misleading messages).
Obviously if we try the first-chance path without being told by the BIOS
to do so, we should not complain either, so do not print the message in
this case.
The 64-bit variation should be updated with a call to
replace_pin_at_irq() which can be done with the upcoming merge. Since
add_pin_to_irq() is now always called in the first-chance path, the
condition to require it in the second-chance path no longer happens.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Keep the timer interrupt line masked when reconfiguring its interrupt
redirection entry in the I/O APIC.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unmask the timer interrupt line set up in the through-8259A mode
explicitly after setup_timer_IRQ0_pin() has set up the I/O APIC interrupt
redirection entry to let the two operations be unbound from each other.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename setup_ExtINT_IRQ0_pin() to setup_timer_IRQ0_pin() to better
reflect the upcoming role of a function setting up a (semi-)arbitrary I/O
APIC pin appropriately for the 8254 timer. By "appropriate" the following
settings are meant: edge-triggered, active-high, all the other settings
per-architecture. Adjust comments to reflect code appropriately. No
functional changes.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The LINT0 line of the local APIC is masked in the LVT0 entry in
check_timer() before this function is ever called. Removed the
redundant unmasking for better control.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For a better control the masking and unmasking of the timer interrupt
line in the 8259A operating in the 'Virtual Wire' mode has been moved out
of setup_ExtINT_IRQ0_pin() now, so remove the redundant calls from the
function.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the through-8259A mode is used for the timer, the call to
set_irq_handler() will register a NULL handler name, resulting in
"IO-APIC-<NULL>" reported. Fix by calling ioapic_register_intr() as done
for all the other I/O APIC interrupts.
The 64-bit variation calls set_irq_chip_and_handler_name() here
needlessly and should get fixed with the upcoming merge.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The local APIC interrupt handler gets registered with
set_irq_chip_and_handler_name(), which results in
"local-APIC-edge-fasteoi" reported as the name of the handler. Fix by
removing the type of the handler left over from before the generic
handlers were introduced.
The 64-bit variation should get fixed with the upcoming merge.
NB It should really use the "edge" handler and not the "fasteoi" one,
but that's a separate issue.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no point in keeping the 8259A enabled if the I/O APIC NMI
watchdog has failed and the 8259A is not used to pass through regular
timer interrupts. This fixes problems with some systems where some logic
gets confused.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If configured to use the I/O APIC, the NMI watchdog is deemed to fail if
the chip has been deactivated as a result of "nosmp". Downgrade to the
local APIC watchdog similarly to what is done for the UP case.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The local APIC is no longer forced off when "nosmp" has been specified.
Correct the message printed.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Disable the 8259A acting in the "virtual wire" mode to keep the interrupt
line inactive while fiddling with local APIC interrupt vector registers
associated with its destination inputs. To be on the safe side,
especially concerning flipping the trigger mode.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Disable the 8259A when routing of the timer interrupt through the chip to
the local APIC of the primary processor has failed.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the "disable_8254_timer" and "enable_8254_timer" kernel
parameters. Now that AEOI acknowledgements are no longer needed for
correct timer operation, the 8259A can be kept disabled unconditionally
unless interrupts, either timer or watchdog ones, are actually passed
through it.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The code that used to be in do_slow_gettimeoffset() that relied on the
IRR bit of the master 8259A PIC for IRQ0 to check the state of the output
timer 0 of the PIT is no longer there. As a result, there is no need to
use the POLL command to acknowledge the timer interrupt in the "8259A
Virtual Wire", except for the NMI watchdog when the i82489DX APIC is used
(this is because this particular APIC treats NMIs as level-triggered and
keeping the input asserted would keep motherboard NMI sources held off for
too long). Remove the unneeded bits and adjust comments accordingly.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use PAGE_OFFSET macro instead of using 0xffff810000000000UL directly.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: hpa@zytor.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
1) Remove __meminit from update_pages_count. It is used inside
split_pages()
2) Make the code depend on PROC_FS. Doing statistics for nothing is
useless and not adding useless code is nice to the Linux tiny folks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add information about the mapping state of the direct mapping to
/proc/meminfo. I chose /proc/meminfo because that is where all the other
memory statistics are too and it is a generally useful metric even
outside debugging situations. A lot of split kernel pages means the
kernel will run slower.
This way we can see how many large pages are really used for it and how
many are split.
Useful for general insight into the kernel.
v2: Add hotplug locking to 64bit to plug a very obscure theoretical race.
32bit doesn't need it because it doesn't support hotadd for lowmem.
Fix some typos
v3: Rename dpages_cnt
Add CONFIG ifdef for count update as requested by tglx
Expand description
v4: Fix stupid bugs added in v3
Move update_page_count to pageattr.c
Signed-off-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
v2: fix early_panic on this config:
http://redhat.com/~mingo/misc/config-Thu_Jun_19_14_22_37_CEST_2008.bad
reason : struct cpu_vendor_dev size is 16, need to make table to be 16
byte alignment
also print out the cpu supported...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"Form follows function". Code is now where it belongs to.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The name fits better since this is code not only for K8.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Tue, 17 Jun 2008, Rafael J. Wysocki wrote:
>
> BTW, with the C1E patches reverted I don't get the
> WARNING: at /home/rafael/src/linux-next/kernel/smp.c:215 smp_call_function_single+0x3d/0xa2
> in the log. Thomas?
The BROADCAST_FORCE notification uses smp_function_call and therefor
must be run with interrupts enabled.
While at it, add a comment for the BROADCAST_EXIT notifier as well.
Reported-and-bisected-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
C1E on AMD machines is like C3 but without control from the OS. Up to
now we disabled the local apic timer for those machines as it stops
when the CPU goes into C1E. This excludes those machines from high
resolution timers / dynamic ticks, which hurts especially X2 based
laptops.
The current boot time C1E detection has another, more serious flaw
as well: some BIOSes do not enable C1E until the ACPI processor module
is loaded. This causes systems to stop working after that point.
To work nicely with C1E enabled machines we use a separate idle
function, which checks on idle entry whether C1E was enabled in the
Interrupt Pending Message MSR. This allows us to do timer broadcasting
for C1E and covers the late enablement of C1E as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the trampoline code is now used for ACPI resume from suspend to RAM,
the trampoline page tables have to be fixed up during boot not only on SMP
systems, but also on UP systems that use the trampoline.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10923
Reported-by: Dionisus Torimens <djtm@gmx.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some Dell laptops enter resume with apparent garbage in the segment
descriptor registers (almost certainly the result of a botched
transition from protected to real mode.) The only way to clean that
up is to enter protected mode ourselves and clean out the descriptor
registers.
This fixes resume on Dell XPS M1210 and Dell D620.
Reference: http://bugzilla.kernel.org/show_bug.cgi?id=10927
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: pm list <linux-pm@lists.linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This code removes a leftover from the iommu_enable function. The ctrl variable
is assigned but never used.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a check if the early detect code has found AMD IOMMU hardware
descriptions and does not try to initialize hardware if the check failed.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes the amd_iommu=off kernel parameter and honors the generic
iommu=off parameter for the same purpose.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch changes the domain TLB flushing behavior of the driver. When there
is more than one page to flush it flushes the whole domain TLB instead of every
single page. So we send only a single command to the IOMMU in every case which
is faster to execute.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The set_bit_string call in the address allocator is not necessary because its
already called in iommu_area_alloc().
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch replaces the short description text for AMD IOMMU in Kconfig with a
more verbose one.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Cc: robert.richter@amd.com
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When converting the page number in a pte/pmd/pud/pgd between
machine and pseudo-physical addresses, the converted result was
being truncated at 32-bits. This caused failures on machines
with more than 4G of physical memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: "Christopher S. Aker" <caker@theshore.net>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the #ifdef conditional because this comparison is already done in
user_mode_vm().
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Cc: akpm@osdl.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Quirks getting ignored was a bug. Below patch fixes the bug, until
we have the dynamic banks support.
Sysfs choice configuration should not have any issues with the earlier patch
as we look for NR_SYSFS_BANKS in do_machine_check().
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Max Asbock <masbock@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch allows the disabling of decompression messages during
x86 bootup.
Signed-off-by: Ben Collins <ben.collins@canonical.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
First announce ourself, then start working. Currently this module reports
itself when all is completed which is not most modules do. Plus some
cosmetic/whitespace cleanups.
Signed-off-by: Ben Castricum <lk0806@bencastricum.nl>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:524: warning: passing argument 2 of 'find_e820_area_size' from incompatible pointer type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fedora reports that mem_init()'s zap_low_mappings(), extended to SMP in
61165d7a03 x86: fix app crashes after SMP
resume causes 32-bit Intel Mac machines to reboot very early when
booting with EFI.
The EFI code appears to manage low mappings for itself when needed; but
like many before it, confuses PSE with PAE. So it has only been mapping
half the space it needed when PSE but not PAE. This remained unnoticed
until we moved the SMP zap_low_mappings() before
efi_enter_virtual_mode(). Presumably could have been noticed years ago
if anyone ran a UP kernel on such machines?
Reported-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Peter Jones <pjones@redhat.com>
Cc: Glauber Costa <gcosta@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Peter Jones <pjones@redhat.com>
'man 3 printf' tells me that %p should be printed as if by %#x, but
this is not true for the kernel, which does not use the '0x' prefix
for the %p conversion specifier.
A small cast to (void *) is also prettier than #ifdef/#else/#endif.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ptrace has always returned only -EIO for all failures to access
registers. The user_regset calls are allowed to return a more
meaningful variety of errors. The REGSET_XFP calls use -ENODEV
for !cpu_has_fxsr hardware. Make ptrace return the traditional
-EIO instead of the error code from the user_regset call.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeremy Fitzhardinge wrote:
>
> Maybe it really does require the far jump immediately after setting PE
> in cr0...
>
> Hm, I don't remember this paragraph being in vol 3a, section 8.9.1
> before. Is it a recent addition?
>
> Random failures can occur if other instructions exist between steps
> 3 and 4 above. Failures will be readily seen in some situations,
> such as when instructions that reference memory are inserted between
> steps 3 and 4 while in system management mode.
>
I don't remember that, either.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit 4323838215
x86: change size of node ids from u8 to s16
set the range for NODES_SHIFT to 1..15.
The possible range is 1..9
Fixes Bugzilla #10726
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: build failure in maximal configurations
The 32-bit x86 relocatable kernel requires an auxilliary host program
to process the relocations. This program had a hard-coded arbitrary
limit of a 100 ELF sections. Instead of a hard-coded limit, allocate
the structures dynamically.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
This patch disables suspend/resume on machines with AMD IOMMU enabled. Real
suspend/resume support for AMD IOMMU is currently being worked on. Until this
is ready it will be disabled to avoid data corruption when the IOMMU is not
properly re-enabled at resume. The patch is based on a similar patch for the
GART driver written by Pavel Machek.
The overall driver merged into tip/master is tested with parallel disk and
network loads and showed no problems in a test running for 3 days.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: bhavna.sarathy@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace GET/SET FPXREGS broken
x86: fix cpu hotplug crash
x86: section/warning fixes
x86: shift bits the right way in native_read_tscp
When I update kernel 2.6.25 from 2.6.24, gdb does not work.
On 2.6.25, ptrace(PTRACE_GETFPXREGS, ...) returns ENODEV.
But 2.6.24 kernel's ptrace() returns EIO.
It is issue of compatibility.
I attached test program as pt.c and patch for fix it.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <errno.h>
#include <sys/ptrace.h>
#include <sys/types.h>
struct user_fxsr_struct {
unsigned short cwd;
unsigned short swd;
unsigned short twd;
unsigned short fop;
long fip;
long fcs;
long foo;
long fos;
long mxcsr;
long reserved;
long st_space[32]; /* 8*16 bytes for each FP-reg = 128 bytes */
long xmm_space[32]; /* 8*16 bytes for each XMM-reg = 128 bytes */
long padding[56];
};
int main(void)
{
pid_t pid;
pid = fork();
switch(pid){
case -1:/* error */
break;
case 0:/* child */
child();
break;
default:
parent(pid);
break;
}
return 0;
}
int child(void)
{
ptrace(PTRACE_TRACEME);
kill(getpid(), SIGSTOP);
sleep(10);
return 0;
}
int parent(pid_t pid)
{
int ret;
struct user_fxsr_struct fpxregs;
ret = ptrace(PTRACE_GETFPXREGS, pid, 0, &fpxregs);
if(ret < 0){
printf("%d: %s.\n", errno, strerror(errno));
}
kill(pid, SIGCONT);
wait(pid);
return 0;
}
/* in the kerel, at kernel/i387.c get_fpxregs() */
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Vegard Nossum reported crashes during cpu hotplug tests:
http://marc.info/?l=linux-kernel&m=121413950227884&w=4
In function _cpu_up, the panic happens when calling
__raw_notifier_call_chain at the second time. Kernel doesn't panic when
calling it at the first time. If just say because of nr_cpu_ids, that's
not right.
By checking the source code, I found that function do_boot_cpu is the culprit.
Consider below call chain:
_cpu_up=>__cpu_up=>smp_ops.cpu_up=>native_cpu_up=>do_boot_cpu.
So do_boot_cpu is called in the end. In do_boot_cpu, if
boot_error==true, cpu_clear(cpu, cpu_possible_map) is executed. So later
on, when _cpu_up calls __raw_notifier_call_chain at the second time to
report CPU_UP_CANCELED, because this cpu is already cleared from
cpu_possible_map, get_cpu_sysdev returns NULL.
Many resources are related to cpu_possible_map, so it's better not to
change it.
Below patch against 2.6.26-rc7 fixes it by removing the bit clearing in
cpu_possible_map.
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Apparently, DOS and possibly other legacy operating systems issued a
null command to the keyboard controller after toggling A20,
specifically "pulse output pins" with no output pins specified. This
was presumably done for synchronization reasons. This has made it
into at least the UHCI spec, and it has been found to cause
compatibility problems when "legacy USB" is enabled (which it almost
always is) to not have this byte sent.
It is *NOT* clear if any of these compatibility problems has any
effect on Linux. However, for maximum compatibility, issue this null
command after togging A20 through the KBC.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
From the code:
"B stepping K8s sometimes report an truncated RIP for IRET exceptions
returning to compat mode. Check for these here too."
The code then proceeds to truncate the upper 32 bits of %rbp. This means
that when do_page_fault() is finally called, its prologue,
do_page_fault:
push %rbp
movl %rsp, %rbp
will put the truncated base pointer on the stack. This means that the
stack tracer will not be able to follow the base-pointer changes and
will see all subsequent stack frames as unreliable.
This patch changes the code to use a different register (%rcx) for the
checking and leaves %rbp untouched.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix typo causing:
arch/x86/kernel/built-in.o: In function `__unmap_single':
amd_iommu.c:(.text+0x17771): undefined reference to `iommu_area_free'
arch/x86/kernel/built-in.o: In function `__map_single':
amd_iommu.c:(.text+0x1797a): undefined reference to `iommu_area_alloc'
amd_iommu.c:(.text+0x179a2): undefined reference to `iommu_area_alloc'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/amd_iommu.c: In function ‘amd_iommu_init_dma_ops':
arch/x86/kernel/amd_iommu.c:940: error: lvalue required as left operand of assignment
arch/x86/kernel/amd_iommu.c:941: error: lvalue required as left operand of assignment
due to !CONFIG_GART_IOMMU.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/amd_iommu_init.c:247: warning: 'struct acpi_table_header' declared inside parameter list
arch/x86/kernel/amd_iommu_init.c:247: warning: its scope is only this definition or declaration, which is probably not what you want
arch/x86/kernel/amd_iommu_init.c: In function 'find_last_devid_acpi':
arch/x86/kernel/amd_iommu_init.c:257: error: dereferencing pointer to incomplete type
arch/x86/kernel/amd_iommu_init.c:265: error: dereferencing pointer to incomplete type
the AMD IOMMU code depends on ACPI facilities.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/mm/built-in.o(.text+0x3a1): Section mismatch in
reference from the function set_pte_phys() to the function
.init.text:spp_getpage()
The function set_pte_phys() references
the function __init spp_getpage().
This is often because set_pte_phys lacks a __init
annotation or the annotation of spp_getpage is wrong.
arch/x86/mm/init_64.c: In function 'early_memtest':
arch/x86/mm/init_64.c:520: warning: passing argument 2 of
'find_e820_area_size' from incompatible pointer type
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Cc: "Linus Torvalds" <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some Xen hypercalls accept an array of operations to work on. In
general this is because its more efficient for the hypercall to the
work all at once rather than as separate hypercalls (even batched as a
multicall).
This patch adds a mechanism (xen_mc_extend_args()) to allocate more
argument space to the last-issued multicall, in order to extend its
argument list.
The user of this mechanism is xen/mmu.c, which uses it to extend the
args array of mmu_update. This is particularly valuable when doing
the update for a large mprotect, which goes via
ptep_modify_prot_commit(), but it also manages to batch updates to
pgd/pmds as well.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen has a pte update function which will update a pte while preserving
its accessed and dirty bits. This means that ptep_modify_prot_start() can be
implemented as a simple read of the pte value. The hardware may
update the pte in the meantime, but ptep_modify_prot_commit() updates it while
preserving any changes that may have happened in the meantime.
The updates in ptep_modify_prot_commit() are batched if we're currently in lazy
mmu mode.
The mmu_update hypercall can take a batch of updates to perform, but
this code doesn't make particular use of that feature, in favour of
using generic multicall batching to get them all into the hypervisor.
The net effect of this is that each mprotect pte update turns from two
expensive trap-and-emulate faults into they hypervisor into a single
hypercall whose cost is amortized in a batched multicall.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds paravirt-ops hooks in pv_mmu_ops for ptep_modify_prot_start and
ptep_modify_prot_commit. This allows the hypervisor-specific backends to
implement these in some more efficient way.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates-2.6.26' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
KVM: Remove now unused structs from kvm_para.h
x86: KVM guest: Use the paravirt clocksource structs and functions
KVM: Make kvm host use the paravirt clocksource structs
x86: Make xen use the paravirt clocksource structs and functions
x86: Add structs and functions for paravirt clocksource
KVM: VMX: Fix host msr corruption with preemption enabled
KVM: ioapic: fix lost interrupt when changing a device's irq
KVM: MMU: Fix oops on guest userspace access to guest pagetable
KVM: MMU: large page update_pte issue with non-PAE 32-bit guests (resend)
KVM: MMU: Fix rmap_write_protect() hugepage iteration bug
KVM: close timer injection race window in __vcpu_run
KVM: Fix race between timer migration and vcpu migration
This patch updates the kvm host code to use the pvclock structs
and functions, thereby making it compatible with Xen.
The patch also fixes an initialization bug: on SMP systems the
per-cpu has two different locations early at boot and after CPU
bringup. kvmclock must take that in account when registering the
physical address within the host.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the kvm host code to use the pvclock structs.
It also makes the paravirt clock compatible with Xen.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch updates the xen guest to use the pvclock structs
and helper functions.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds structs for the paravirt clocksource ABI
used by both xen and kvm (pvclock-abi.h).
It also adds some helper functions to read system time and
wall clock time from a paravirtual clocksource (pvclock.[ch]).
They are based on the xen code. They are enabled using
CONFIG_PARAVIRT_CLOCK.
Subsequent patches of this series will put the code in use.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Non-PAE operation has been deprecated in Xen for a while, and is
rarely tested or used. xen-unstable has now officially dropped
non-PAE support. Since Xen/pvops' non-PAE support has also been
broken for a while, we may as well completely drop it altogether.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As suggested by Ingo, remove all references to tsc from init/calibrate.c
TSC is x86 specific, and using tsc in variable names in a generic file should
be avoided. lpj_tsc is now called lpj_fine, since it is related to fine tuning
of lpj value. Also tsc_rate_* is called timer_rate_*
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Daniel Hecht <dhecht@vmware.com>
Cc: Tim Mann <mann@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Cc: Sahil Rihan <srihan@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
BUG: using smp_processor_id() in preemptible [00000000] code: oprofiled/27301
caller is nmi_shutdown+0x11/0x60
Pid: 27301, comm: oprofiled Not tainted 2.6.26-rc7 #25
[<c028a90d>] debug_smp_processor_id+0xbd/0xc0
[<c045fba1>] nmi_shutdown+0x11/0x60
[<c045dd4a>] oprofile_shutdown+0x2a/0x60
Note that we don't need this for the other functions, since they are all
called with on_each_cpu() (which disables preemption for us anyway).
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Philippe Elie <phil.el@wanadoo.fr>
Cc: oprofile-list@lists.sf.net
Cc: Johannes Weiner <hannes@saeurebad.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... to strip down loop body in reserve_memtype.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
... and move last debug message out of locked section.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- add BUG statement to catch invalid start and end parameters
- No need to track the actual type in both req_type and actual_type --
keep req_type unchanged.
- removed (IMHO) superfluous comments
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- parse/ml => entry (within list_for_each and friends)
- new_entry => new
- ret_type => new_type (to avoid confusion with req_type)
(... to make it more readable...)
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh B Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switching msrs can occur either synchronously as a result of calls to
the msr management functions (usually in response to the guest touching
virtualized msrs), or asynchronously when preempting a kvm thread that has
guest state loaded. If we're unlucky enough to have the two at the same
time, host msrs are corrupted and the machine goes kaput on the next syscall.
Most easily triggered by Windows Server 2008, as it does a lot of msr
switching during bootup.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM has a heuristic to unshadow guest pagetables when userspace accesses
them, on the assumption that most guests do not allow userspace to access
pagetables directly. Unfortunately, in addition to unshadowing the pagetables,
it also oopses.
This never triggers on ordinary guests since sane OSes will clear the
pagetables before assigning them to userspace, which will trigger the flood
heuristic, unshadowing the pagetables before the first userspace access. One
particular guest, though (Xenner) will run the kernel in userspace, triggering
the oops. Since the heuristic is incorrect in this case, we can simply
remove it.
Signed-off-by: Avi Kivity <avi@qumranet.com>
kvm_mmu_pte_write() does not handle 32-bit non-PAE large page backed
guests properly. It will instantiate two 2MB sptes pointing to the same
physical 2MB page when a guest large pte update is trapped.
Instead of duplicating code to handle this, disallow directory level
updates to happen through kvm_mmu_pte_write(), so the two 2MB sptes
emulating one guest 4MB pte can be correctly created by the page fault
handling path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
rmap_next() does not work correctly after rmap_remove(), as it expects
the rmap chains not to change during iteration. Fix (for now) by restarting
iteration from the beginning.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If a timer fires after kvm_inject_pending_timer_irqs() but before
local_irq_disable() the code will enter guest mode and only inject such
timer interrupt the next time an unrelated event causes an exit.
It would be simpler if the timer->pending irq conversion could be done
with IRQ's disabled, so that the above problem cannot happen.
For now introduce a new vcpu requests bit to cancel guest entry.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
A guest vcpu instance can be scheduled to a different physical CPU
between the test for KVM_REQ_MIGRATE_TIMER and local_irq_disable().
If that happens, the timer will only be migrated to the current pCPU on
the next exit, meaning that guest LAPIC timer event can be delayed until
a host interrupt is triggered.
Fix it by cancelling guest entry if any vcpu request is pending. This
has the side effect of nicely consolidating vcpu->requests checks.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
fix:
arch/x86/kernel/tsc_32.c: In function ‘tsc_init':
arch/x86/kernel/tsc_32.c:421: error: ‘lpj_tsc' undeclared (first use in this function)
arch/x86/kernel/tsc_32.c:421: error: (Each undeclared identifier is reported only once
arch/x86/kernel/tsc_32.c:421: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the x86 platform we can use the value of tsc_khz computed during tsc
calibration to calculate the loops_per_jiffy value. Its very important
to keep the error in lpj values to minimum as any error in that may
result in kernel panic in check_timer. In virtualization environment, On
a highly overloaded host the guest delay calibration may sometimes
result in errors beyond the ~50% that timer_irq_works can handle,
resulting in the guest panicking.
Does some formating changes to lpj_setup code to now have a single
printk to print the bogomips value.
We do this only for the boot processor because the AP's can have
different base frequencies or the BIOS might boot a AP at a different
frequency.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Daniel Hecht <dhecht@vmware.com>
Cc: Tim Mann <mann@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Cc: Sahil Rihan <srihan@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kevin Winchester reported a GART related direct rendering failure against
linux-next-20080611, which shows up via these log entries:
PCI: Using ACPI for IRQ routing
PCI: Cannot allocate resource region 0 of device 0000:00:00.0
agpgart: Detected AGP bridge 0
agpgart: Aperture conflicts with PCI mapping.
agpgart: Aperture from AGP @ e0000000 size 128 MB
agpgart: Aperture conflicts with PCI mapping.
agpgart: No usable aperture found.
agpgart: Consider rebooting with iommu=memaper=2 to get a good aperture.
instead of the expected:
PCI: Using ACPI for IRQ routing
agpgart: Detected AGP bridge 0
agpgart: Aperture from AGP @ e0000000 size 128 MB
Kevin bisected it down to this change in tip/x86/gart:
"x86: checking aperture size order".
agp check is using request_mem_region(), and could fail if e820 is reserved...
change it back to e820_any_mapped().
Reported-and-bisected-by: "Kevin Winchester" <kjwinchester@gmail.com>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Tested-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix build failure:
arch/x86/mm/pgtable.c:280: warning: ‘enum fixed_addresses’ declared inside parameter list
arch/x86/mm/pgtable.c:280: warning: its scope is only this definition or declaration, which is probably not what you want
arch/x86/mm/pgtable.c:280: error: parameter 1 (‘idx’) has incomplete type
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In both cases, I went with the 32-bit behaviour.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because NX is now enforced properly, we must put the hypercall page
into the .text segment so that it is executable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Stable: this isn't a bugfix in itself, but it's a pre-requiste
for "xen: don't drop NX bit" ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Because NX is now enforced properly, we must put the hypercall page
into the .text segment so that it is executable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Stable: this isn't a bugfix in itself, but it's a pre-requiste
for "xen: don't drop NX bit" ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: the arch/x86 maintainers <x86@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
General Software writes their own VSA2 module for their version
of the Geode BIOS, which returns a different ID then the standard
VSA2. This was causing the framebuffer driver to break for most
GSW boards.
Signed-off-by: Jordan Crouse <jordan.crouse@amd.com>
Cc: tglx@linutronix.de
Cc: linux-geode@lists.infradead.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch uses the BOOTMEM_EXCLUSIVE for crashkernel reservation also for
i386 and prints a error message on failure.
The patch is still for 2.6.26 since it is only bug fixing. The unification
of reserve_crashkernel() between i386 and x86_64 should be done for 2.6.27.
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Booting 2.6.26-rc6 on my 486 DX/4 fails with a "BUG: Int 6"
(invalid opcode) and a kernel halt immediately after the
kernel has been uncompressed. The BUG shows EIP pointing
to an rdtsc instruction in native_read_tsc(), invoked from
native_sched_clock().
(This error occurs so early that not even the serial console
can capture it.)
A bisection showed that this bug first occurs in 2.6.26-rc3-git7,
via commit 9ccc906c97e34fd91dc6aaf5b69b52d824386910:
>x86: distangle user disabled TSC from unstable
>
>tsc_enabled is set to 0 from the command line switch "notsc" and from
>the mark_tsc_unstable code. Seperate those functionalities and replace
>tsc_enable with tsc_disable. This makes also the native_sched_clock()
>decision when to use TSC understandable.
>
>Preparatory patch to solve the sched_clock() issue on 32 bit.
>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The core reason for this bug is that native_sched_clock() gets
called before tsc_init().
Before the commit above, tsc_32.c used a "tsc_enabled" variable
which defaulted to 0 == disabled, and which only got enabled late
in tsc_init(). Thus early calls to native_sched_clock() would skip
the TSC and use jiffies instead.
After the commit above, tsc_32.c uses a "tsc_disabled" variable
which defaults to 0, meaning that the TSC is Ok to use. Early calls
to native_sched_clock() now erroneously try to use the TSC on
!cpu_has_tsc processors, leading to invalid opcode exceptions.
My proposed fix is to initialise tsc_disabled to a "soft disabled"
state distinct from the hard disabled state set up by the "notsc"
kernel option. This fixes the native_sched_clock() problem. It also
allows tsc_init() to be simplified: instead of setting tsc_disabled = 1
on every error return, we just set tsc_disabled = 0 once when all
checks have succeeded.
I've verified that this lets my 486 boot again. I've also verified
that a Core2 machine still uses the TSC as clocksource after the patch.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Patrick McHardy reported a crash:
> > I get this oops once a day, its apparently triggered by something
> > run by cron, but the process is a different one each time.
> >
> > Kernel is -git from yesterday shortly before the -rc6 release
> > (last commit is the usb-2.6 merge, the x86 patches are missing),
> > .config is attached.
> >
> > I'll retry with current -git, but the patches that have gone in
> > since I last updated don't look related.
> >
> > [62060.043009] BUG: unable to handle kernel NULL pointer dereference at
> > 000001ff
> > [62060.043009] IP: [<c0102a9b>] __switch_to+0x2f/0x118
> > [62060.043009] *pde = 00000000
> > [62060.043009] Oops: 0002 [#1] PREEMPT
Vegard Nossum analyzed it:
> This decodes to
>
> 0: 0f ae 00 fxsave (%eax)
>
> so it's related to the floating-point context. This is the exact
> location of the crash:
>
> $ addr2line -e arch/x86/kernel/process_32.o -i ab0
> include/asm/i387.h:232
> include/asm/i387.h:262
> arch/x86/kernel/process_32.c:595
>
> ...so it looks like prev_task->thread.xstate->fxsave has become NULL.
> Or maybe it never had any other value.
Somehow (as described below) TS_USEDFPU is set but the fpu is not
allocated or freed.
Another possible FPU pre-emption issue with the sleazy FPU optimization
which was benign before but not so anymore, with the dynamic FPU allocation
patch.
New task is getting exec'd and it is prempted at the below point.
flush_thread() {
...
/*
* Forget coprocessor state..
*/
clear_fpu(tsk);
<----- Preemption point
clear_used_math();
...
}
Now when it context switches in again, as the used_math() is still set
and fpu_counter can be > 5, we will do a math_state_restore() which sets
the task's TS_USEDFPU. After it continues from the above preemption point
it does clear_used_math() and much later free_thread_xstate().
Now, at the next context switch, it is quite possible that xstate is
null, used_math() is not set and TS_USEDFPU is still set. This will
trigger unlazy_fpu() causing kernel oops.
Fix this by clearing tsk's fpu_counter before clearing task's fpu.
Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clean up over-complications in pat_x_mtrr_type().
And if reserve_memtype() ignores stray req_type bits when
pat_enabled, it's better to mask them off when not also.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
attached is a no-brainer that makes kernel correctly report
NR_BANKS for MCE. We are right now limited to NR_BANKS==6, but the
error message will use the available number of banks instead of the
defined maximum.
For a Nehalem based system it will print:
"MCE: warning: using only 9 banks"
while the correct message would be
"MCE: warning: using only 6 banks"
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most users by far do not care about the exact return value (they only
really care about whether the copy succeeded in its entirety or not),
but a few special core routines actually care deeply about exactly how
many bytes were copied from user space.
And the unrolled versions of the x86-64 user copy routines would
sometimes report that it had copied more bytes than it actually had.
Very few uses actually have partial copies to begin with, but to make
this bug even harder to trigger, most x86 CPU's use the "rep string"
instructions for normal user copies, and that version didn't have this
issue.
To make it even harder to hit, the one user of this that really cared
about the return value (and used the uncached version of the copy that
doesn't use the "rep string" instructions) was the generic write
routine, which pre-populated its source, once more hiding the problem by
avoiding the exception case that triggers the bug.
In other words, very special thanks to Bron Gondwana who not only
triggered this, but created a test-program to show it, and bisected the
behavior down to commit 08291429cf ("mm:
fix pagecache write deadlocks") which changed the access pattern just
enough that you can now trigger it with 'writev()' with multiple
iovec's.
That commit itself was not the cause of the bug, it just allowed all the
stars to align just right that you could trigger the problem.
[ Side note: this is just the minimal fix to make the copy routines
(with __copy_from_user_inatomic_nocache as the particular version that
was involved in showing this) have the right return values.
We really should improve on the exceptional case further - to make the
copy do a byte-accurate copy up to the exact page limit that causes it
to fail. As it is, the callers have to do extra work to handle the
limit case gracefully. ]
Reported-by: Bron Gondwana <brong@fastmail.fm>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(which didn't have this problem), and since
most users that do the carethis was very hard to trigger, but
when trying to understand how Bogomips are implemented I have found a
bug in arch/i386/lib/delay.c file, delay_loop function.
The function fails for loops > 2^31+1. It because SF is set when dec
returns numbers > 2^31.
The fix is to use jnz instruction instead of jns (and add one decl
instruction to the end to have exactly the same number of loops as in
original version).
Martin Mares observed:
> It is a long time since I have hacked that file, but you should definitely
> make sure that the function is never called with a zero argument. In such
> case, the original version made just a single pass, but your version
> makes 2^32 of them.
fixed that.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: fixup write combine comment in pci_mmap_resource
x86: PAT export resource_wc in pci sysfs
x86, pci-dma.c: don't always add __GFP_NORETRY to gfp
suspend-vs-iommu: prevent suspend if we could not resume
x86: pci-dma.c: use __GFP_NO_OOM instead of __GFP_NORETRY
pci, x86: add workaround for bug in ASUS A7V600 BIOS (rev 1005)
PCI: use dev_to_node in pci_call_probe
PCI: Correct last two HP entries in the bfsort whitelist
Recently (around 2.6.25) I've noticed that RTC no longer works for me. It
turned out this is because I use pnpacpi=off kernel option to work around
the parport_pc bugs. I always did so, but RTC used to work fine in the
past, and now it have regressed.
The patch fixes the problem by creating the platform device for the RTC
when PNP is disabled. This may also help running the PNP-enabled kernel
on an older PCs.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
Cc: David Brownell <david-b@pacbell.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Adam Belay <ambx1@neo.rr.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changed the call to find_e820_area_size to pass u64 instead of unsigned long.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alessandro Suardi reported:
> Recently upgraded my FC6 desktop to Fedora 9; with the
> latest nautilus RPM updates my VNC session went nuts
> with nautilus pegging the CPU for everything that breathed.
>
> I now reverted to an earlier nautilus package, but during
> the peak CPU period my kernel spat this:
>
> [314185.623294] ------------[ cut here ]------------
> [314185.623414] WARNING: at kernel/lockdep.c:2658 check_flags+0x4c/0x128()
> [314185.623514] Modules linked in: iptable_filter ip_tables x_tables
> sunrpc ipv6 fuse snd_via82xx snd_ac97_codec ac97_bus snd_mpu401_uart
> snd_rawmidi via686a hwmon parport_pc sg parport uhci_hcd ehci_hcd
> [314185.623924] Pid: 12314, comm: nautilus Not tainted 2.6.26-rc5-git2 #4
> [314185.624021] [<c0115b95>] warn_on_slowpath+0x41/0x7b
> [314185.624021] [<c010de70>] ? do_page_fault+0x2c1/0x5fd
> [314185.624021] [<c0128396>] ? up_read+0x16/0x28
> [314185.624021] [<c010de70>] ? do_page_fault+0x2c1/0x5fd
> [314185.624021] [<c012fa33>] ? __lock_acquire+0xbb4/0xbc3
> [314185.624021] [<c012d0a0>] check_flags+0x4c/0x128
> [314185.624021] [<c012fa73>] lock_acquire+0x31/0x7d
> [314185.624021] [<c0128cf6>] __atomic_notifier_call_chain+0x30/0x80
> [314185.624021] [<c0128cc6>] ? __atomic_notifier_call_chain+0x0/0x80
> [314185.624021] [<c0128d52>] atomic_notifier_call_chain+0xc/0xe
> [314185.624021] [<c0128d81>] notify_die+0x2d/0x2f
> [314185.624021] [<c01043b0>] do_int3+0x1f/0x4d
> [314185.624021] [<c02f2d3b>] int3+0x27/0x2c
> [314185.624021] =======================
> [314185.624021] ---[ end trace 1923f65a2d7bb246 ]---
> [314185.624021] possible reason: unannotated irqs-off.
> [314185.624021] irq event stamp: 488879
> [314185.624021] hardirqs last enabled at (488879): [<c0102d67>]
> restore_nocheck+0x12/0x15
> [314185.624021] hardirqs last disabled at (488878): [<c0102dca>]
> work_resched+0x19/0x30
> [314185.624021] softirqs last enabled at (488876): [<c011a1ba>]
> __do_softirq+0xa6/0xac
> [314185.624021] softirqs last disabled at (488865): [<c010476e>]
> do_softirq+0x57/0xa6
>
> I didn't seem to find it with some googling, so here it is.
>
> I was incidentally ltracing that process to try and find out
> what was gulping down that much CPU (sorry, no idea
> whether ltrace and the WARNING happened at the same
> time or which came first) and:
Yeah, this is extremely likely to be the source of the warning.
The warning should be harmless, however.
> Box is my trusty noname K7-800, 512MB RAM; if there's
> anything else useful I might be able to provide, just ask.
It would be interesting to see where the int3 comes from. Too bad,
lockdep doesn't provide the register dump. The stacktrace also doesn't
go further than the int3(), I wonder if this int3 came from userspace?
The ltrace readme says "software breakpoints, like gdb", so I guess
this is the case. Yep, seems like it.
This looks relevant:
| commit fb1dac909d
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date: Wed Jan 16 09:51:59 2008 +0100
|
| lockdep: more hardirq annotations for notify_die()
I'm attaching a similarly-looking patch for this case (DO_VM86_ERROR),
though I suspect it might be missing for the other cases
(DO_ERROR/DO_ERROR_INFO) as well.
Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix an incompatible pointer type warning on x86_64 compilations.
early_memtest() is passing a u64* to find_e820_area_size() which is expecting
an unsigned long. Change t_start and t_size to unsigned long as those are
also 64-bit types on x88_64.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>