If BIO is discarded or cross over end of device,
BIO queueing trial doesn't occur.
Actually the trace was called just before make_request at first:
[PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4
And then 2 patches added some checks between them:
[PATCH] md: check bio address after mapping through partitions
5ddfe9691c91a244e8d1be597b6428fcefd58103,
[BLOCK] Don't allow empty barriers to be passed down to
queues that don't grok them
51fd77bd9f512ab6cc9df0733ba1caaab89eb957
It breaks original goal.
Let's trace it only when it happens.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Test results here look good, and on big OLTP runs it has also shown
to significantly increase cycles attributed to the database and
cause a performance boost.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The blktrace tools can show process id when cfq dispatched a request,
using cfq_log_cfqq() instead of cfq_log().
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
commit 22bece00dc
(cciss: fix regression firmware not displayed in procfs)
added a small memory leak in cciss_init_one()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Splice should update the modification and access times on regular
files just like read and write. Not updating mtime will confuse
backup tools, etc...
This patch only adds the time updates for regular files. For pipes
and other special files that splice touches the need for updating the
times is less clear. Let's discuss and fix that separately.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It's not currently used, as pointed out by
Gui Jianfeng <guijianfeng@cn.fujitsu.com>. We already check the
wait_request flag to allow an idling queue priority allocation access,
so we don't need this extra flag.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This borrows some code from NAPI and implements a polled completion
mode for block devices. The idea is the same as NAPI - instead of
doing the command completion when the irq occurs, schedule a dedicated
softirq in the hopes that we will complete more IO when the iopoll
handler is invoked. Devices have a budget of commands assigned, and will
stay in polled mode as long as they continue to consume their budget
from the iopoll softirq handler. If they do not, the device is set back
to interrupt completion mode.
This patch holds the core bits for blk-iopoll, device driver support
sold separately.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Instead of just checking whether this device uses block layer
tagging, we can improve the detection by looking at the maximum
queue depth it has reached. If that crosses 4, then deem it a
queuing device.
This is important on high IOPS devices, since plugging hurts
the performance there (it can be as much as 10-15% of the sys
time).
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Whenever a block device changes it's read-only attribute
notify the userspace about it.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
o Get rid of busy_rt_queues infrastructure. Looks like it is redundant.
o Once an RT queue gets request it will preempt any of the BE or IDLE queues
immediately. Otherwise this queue will be put on service tree and scheduler
will anyway select this queue before any of the BE or IDLE queue. Hence
looks like there is no need to keep track of how many busy RT queues are
currently on service tree.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
To lessen the impact of async IO on sync IO, let the device drain of
any async IO in progress when switching to a sync cfqq that has idling
enabled.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Update scsi_io_completion() such that it only fails requests till the
next error boundary and retry the leftover. This enables block layer
to merge requests with different failfast settings and still behave
correctly on errors. Allow merge of requests of different failfast
settings.
As SCSI is currently the only subsystem which follows failfast status,
there's no need to worry about other block drivers for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Failfast has characteristics from other attributes. When issuing,
executing and successuflly completing requests, failfast doesn't make
any difference. It only affects how a request is handled on failure.
Allowing requests with different failfast settings to be merged cause
normal IOs to fail prematurely while not allowing has performance
penalties as failfast is used for read aheads which are likely to be
located near in-flight or to-be-issued normal IOs.
This patch introduces the concept of 'mixed merge'. A request is a
mixed merge if it is merge of segments which require different
handling on failure. Currently the only mixable attributes are
failfast ones (or lack thereof).
When a bio with different failfast settings is added to an existing
request or requests of different failfast settings are merged, the
merged request is marked mixed. Each bio carries failfast settings
and the request always tracks failfast state of the first bio. When
the request fails, blk_rq_err_bytes() can be used to determine how
many bytes can be safely failed without crossing into an area which
requires further retrials.
This allows request merging regardless of failfast settings while
keeping the failure handling correct.
This patch only implements mixed merge but doesn't enable it. The
next one will update SCSI to make use of mixed merge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Niel Lambrechts <niel.lambrechts@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bio and request use the same set of failfast bits. This patch makes
the following changes to simplify things.
* enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_*
bits coincide with __REQ_FAILFAST_* bits.
* The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV
but the matching is useless anyway. init_request_from_bio() is
responsible for setting FAILFAST bits on FS requests and non-FS
requests never use BIO_RW_AHEAD. Drop the code and comment from
blk_rq_bio_prep().
* Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and
simplify FAILFAST flags handling in init_request_from_bio().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Based on a suggestion from Jaswinder, clarify what the user would need
to do to avoid this error message from kmemleak.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
For messages from the tape core that is shared between the 3590 and 34xx
tape disciplines, we want to have the "tape" prefix instead of "tape_3590"
or "tape_34xx". In order to fix this, we now use the pr_xxx printk macros.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
smp_cpu_not_running() and cpu_stopped() are doing the same.
Remove one and also get rid of the last hard_smp_processor_id() leftover.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Merge cpuid.h header file into cpu.h.
While at it convert from typedef to struct declaration and also
convert cio code to use proper lowcore structure instead of casts.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Saves us more than 65k pointless IPIs.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently, when a tape device is set online and no cartridge is loaded, we
get the messages "The tape cartridge has been successfully unloaded" and
"Determining the size of the recorded area". These messages are not correct.
To fix this, we now print the "cartridge loaded/unloaded" messages only,
when the load/unload event really occurs. In addition to that, the message
"Determining the size of the recorded area" is only printed, if a cartridge
is loaded.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Get rid of the PAGE_STATES config option and enable guest page hinting
by default.
It can be disabled by specifying "cmma=off" at the command line.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Using "%s" in sprintf event functions is dangerous. This patch adds a short
description for this issue to the s390 debug feature documentation.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
"lockdep: Fix backtraces" reveales a bug in early setup code: when
lockdep tries to save a stack backtrace before setup_arch has been
called the lowcore pointer for the current thread info pointer isn't
initialized yet.
However our save stack backtrace code relies on it. If the pointer
isn't initialized the saved backtrace will have zero entries.
lockdep however relies (correctly) on the fact that that cannot
happen.
A write access to some random memory region is the result.
Fix this by initializing the thread info pointer early.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Suzuki Poulose reported the following recursive locking bug on s390:
Here is the stack trace : (see Appendix I for more info)
[<0000000000406ed6>] _spin_lock+0x52/0x94
[<0000000000103bde>] crst_table_free+0x14e/0x1a4
[<00000000001ba684>] __pmd_alloc+0x114/0x1ec
[<00000000001be8d0>] handle_mm_fault+0x2cc/0xb80
[<0000000000407d62>] do_dat_exception+0x2b6/0x3a0
[<0000000000114f8c>] sysc_return+0x0/0x8
[<00000200001642b2>] 0x200001642b2
The page_table_lock is already acquired in __pmd_alloc (mm/memory.c) and
it tries to populate the pud/pgd with a new pmd allocated. If another
thread populates it before we get a chance, we free the pmd using
pmd_free().
On s390x, pmd_free(even pud_free ) is #defined to crst_table_free(),
which acquires the page_table_lock to protect the crst_table index updates.
Hence this ends up in a recursive locking of the page_table_lock.
The solution suggested by Dave Hansen is to use a new spin lock in the mmu
context to protect the access to the crst_list and the pgtable_list.
Reported-by: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use a console_initcall() to initialize the s390 virtio console and
clean up s390 console initialization in setup.c.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For printing unsigned integers hypfs uses "%d" in snprintf(). This is wrong.
With this patch "%u" is used instead.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If a named saved system (NSS) cannot be defined or saved, print out an
error message with the return code of the underlying z/VM CP command.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If dev_set_name fails during scanning the AP bus, the reserved memory
has to be freed.
Signed-off-by: Felix Beck <felix.beck@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently in the suspend process checksums for the XPRAM partitions are
created and stored. During the resume process it is checked,
if the checksums are still the same. If this is not the case, a kernel panic
is triggered. Unfortunately this prevents XPRAM from beeing used as suspend
device, because in this case after the checksum has been created, the
memory image is written to XPRAM and therefore the contents of the suspend
partition is changed. In order to allow XPRAM to be used as suspend device,
this patch removes the checksum validation.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The vmur class is allocated after the CCW driver is registered
and it is destroyed before the CCW driver is unregistered.
This is not the correct sequence, because the vmur class can be used
via driver core callbacks that are triggered during the CCW driver
deregistration. For Example:
1. vmur device is online
2. vmur module is unloaded
This leads to the following function call stack:
<4> [<0000000000387286>] device_destroy+0x36/0x5c
<4> [<000003e000209714>] ur_set_offline_force+0x9c/0x10c [vmur]
<4> [<000003e00020a928>] ur_remove+0x64/0xbc [vmur]
<4> [<00000000003e4d2e>] ccw_device_remove+0x42/0x1ac
<4> [<000000000038a1aa>] __device_release_driver+0x9a/0xe4
<4> [<000000000038a2da>] driver_detach+0xe6/0xec
<4> [<0000000000388ee4>] bus_remove_driver+0xc0/0x108
<4> [<000003e00020ad5a>] ur_exit+0x52/0x84 [vmur]
In device_destroy() the vmur class is used. Since it is already freed,
this can lead to a kernel panic.
To fix the problem, the vmur class has to be allocated before the CCW
driver is registered and destroyed after the CCW driver has ben unregistered.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Local variable 'qname' in the function hypfs_create_file() really is not
used for any purpose.
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Cc: Michael Holzheu <holzheu@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
No need to defined a irq_cpustat_t type if __ARCH_IRQ_STAT is defined.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Eleminate the local variable machine_flags and always change machine
flags directly in the lowcore.
This avoids confusion about when and why the two variables have to be
synchronized.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Note that this patch moves .data.init_task inside _edata. In
addition, the alignment of .init.ramfs changes: It is now PAGE_ALIGNED
and __initramfs_end is arbitrarily aligned; Previously it was
only aligned to a 0x100-byte boundary, and always ended on an even
byte.
This change results in fewer output sections and in some data being
reordered, but should have no functional effect.
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Tim Abbott <tabbott@ksplice.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
.data.page_aligned should not need a separate output section, so as
part of this cleanup I moved into the .data output section in the
linker scripts in order to eliminate unnecessary references to the
section name.
Remove the reference to .data.idt, since nothing is put into the
.data.idt section on the s390 architecture. It looks like Cyrill
Gorcunov posted a patch to remove the .data.idt code on s390
previously:
<http://lkml.indiana.edu/hypermail/linux/kernel/0802.2/2536.html>
CCing him and the people who acked that patch in case there's a reason
it wasn't applied.
Signed-off-by: Tim Abbott <tabbott@ksplice.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The sysc_restore_trace_psw and io_restore_trace_psw storage locations
are created in the .text section. When creating and IPLing from a named
saved system (NSS), writing to these locations causes a protection exception
(because the .text section is mapped as shared read-only in the NSS).
To permit write access, move the storage locations into the .data section.
The problem occurs only when CONFIG_TRACE_IRQFLAGS is set.
The git commmit that has introduced these variables is:
411788ea7f
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If the CP SET LOADDEV on the 3215 console has been used to specify
SCPdata, all data is converted to upper case letters.
When scpdata contains upper case letters only, convert all letters
to lower case.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>