We've been relying on the block layer to assume rq->errors being set
translates into -EIO. I noticed in testing that sometimes this isn't
true, and really there's not much of a reason to have a counter instead
of just using -EIO. So set it properly so we don't leak random numbers
to unsuspecting victims.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can submit IO in a processes context, which means there can be
pending signals. This isn't a fatal error for NBD, but it does require
some finesse. If the signal happens before we transmit anything then we
are ok, just requeue the request and carry on. However if we've done a
partial transmit we can't allow anything else to be transmitted on this
socket until we transmit the remaining part of the request. Deal with
this by keeping track of how much we've sent for the current request,
and if we get an ERESTARTSYS during any part of our transmission save
the state of that request and requeue the IO. If anybody tries to
submit a request that isn't our pending request then requeue that
request until we are able to service the one that is pending.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
jewel and later on the server side and all stable kernels, a fixup for
-rc1 CRUSH changes and two usability enhancements: osd_request_timeout
option and supported_features bus attribute.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYwsEIAAoJEEp/3jgCEfOL34sH+wbYyT6uXQ3hlIoRt2FQNh5b
F6qmvH4jYRI+YyjJHgE7lLEv7cq/PESPej2hrw9U7GAso0KEsazOv+qpj4AcW+u1
arXYTIQQa2w9sCuj7/BrbEzDtnNOVnGyD3Ng0wAfvbxg/37xzqumkbccuWJm6GdH
Vjk31G4ZmaOOr38jeo0AkYWgs7kgfthLMFo73TgHTBBO9fkQQQL1xZH5D/Irzf8P
1ytfVyGeTl8D3szdkkOnc4eUFMwJ35wqesL+gAsQntx1/wDnGqa2IabXRs4oqr8F
oT88LXSP8w2PaFKI1FrwOuMov6ngg38tir2SMxGDIQ6TdxtK8lW37Cx3eHavqtE=
=f4Bs
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- a fix for the recently discovered misdirected requests bug present in
jewel and later on the server side and all stable kernels
- a fixup for -rc1 CRUSH changes
- two usability enhancements: osd_request_timeout option and
supported_features bus attribute.
* tag 'ceph-for-4.11-rc2' of git://github.com/ceph/ceph-client:
libceph: osd_request_timeout option
rbd: supported_features bus attribute
libceph: don't set weight to IN when OSD is destroyed
libceph: fix crush_decode() for older maps
Merge fixes from Andrew Morton:
"26 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...
Fix typos and add the following to the scripts/spelling.txt:
overide||override
While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
the NVMe over Fabrics loopback target which potentially sends a huge bulk of
pages attached to the bio's bvec this results in a kernel panic because of
array out of bounds accesses in zram_decompress_page().
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
... so that userspace can generate meaningful error messages and spell
out unsupported features that need to be disabled.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This is the set of stuff that didn't quite make the initial pull and a
set of fixes for stuff which did. The new stuff is basically lpfc
(nvme), qedi and aacraid. The fixes cover a lot of previously
submitted stuff, the most important of which probably covers some of
the failing irq vectors allocation and other fallout from having the
SCSI command allocated as part of the block allocation functions.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYug/3AAoJEAVr7HOZEZN4bRwP/RvAW/NWlvs812gUeHNahUHx
HF3EtY5YHneZev1L4fa1Myd8cNcB3gic53e1WvQhN5f+LczeI7iXHH8xIQK/B4GT
OI5W7ZCGlhIamgsU8tUFGrtIOM6N/LIMZPMsaE62dcev0jC0Db4p/5yv5bUg1yuL
JOLWqIaTYTx7b5FwR2cXKv7KGYpXonvcoVUDV6s8aMYToJhJiQ6JtWmjaAA+/OJ7
HTcqNX/KSWQBG3ERjE3A/rGBFxPR9KPq8kpqVJlw39KEN9tMjXb0QAGvAeauNye2
xaIjV41tXH1Ys3aoh6ZU+TpzN3HLlH/gegRe8671fUwqu/RnSzFC8Eidoddzgqne
z5BF0Dy36FA5ol2HTFEwS5WM1gRM+z6YGnbhpE/XfyTL2sLHRynSXKxb02IdjRNi
gCdV89AL8xdfPPFJjsx4hGWUJhalW/RwuPayAupePKGQHePabHa19McI2ssg0WBg
OC6uuEk7vT95xuTZVJlrwXTJPnc9dxWJwzh21oEnvMfWM0VMW06EKT4AcX8FOsjL
RswiYp9wl8JDmfNxyJnQJj0+QXeIE/gh35vBQCGsLVKY4+xzbi3yZfb1sZB0XOum
yLK2LFQ+Ltk1TaPPSZsgXQZfodot0dFJPTnWr+whOfY1kepNlFRwQHltXnCWzgRs
QG//WHsk0u1Mj2gehwmQ
=e+6v
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"This is the set of stuff that didn't quite make the initial pull and a
set of fixes for stuff which did.
The new stuff is basically lpfc (nvme), qedi and aacraid. The fixes
cover a lot of previously submitted stuff, the most important of which
probably covers some of the failing irq vectors allocation and other
fallout from having the SCSI command allocated as part of the block
allocation functions"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (59 commits)
scsi: qedi: Fix memory leak in tmf response processing.
scsi: aacraid: remove redundant zero check on ret
scsi: lpfc: use proper format string for dma_addr_t
scsi: lpfc: use div_u64 for 64-bit division
scsi: mac_scsi: Fix MAC_SCSI=m option when SCSI=m
scsi: cciss: correct check map error.
scsi: qla2xxx: fix spelling mistake: "seperator" -> "separator"
scsi: aacraid: Fixed expander hotplug for SMART family
scsi: mpt3sas: switch to pci_alloc_irq_vectors
scsi: qedf: fixup compilation warning about atomic_t usage
scsi: remove scsi_execute_req_flags
scsi: merge __scsi_execute into scsi_execute
scsi: simplify scsi_execute_req_flags
scsi: make the sense header argument to scsi_test_unit_ready mandatory
scsi: sd: improve TUR handling in sd_check_events
scsi: always zero sshdr in scsi_normalize_sense
scsi: scsi_dh_emc: return success in clariion_std_inquiry()
scsi: fix memory leak of sdpk on when gd fails to allocate
scsi: sd: make sd_devt_release() static
scsi: qedf: Add QLogic FastLinQ offload FCoE driver framework.
...
Pull vfs 'statx()' update from Al Viro.
This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.
It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?
From David Howells.
Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:
https://www.spinics.net/lists/linux-fsdevel/msg33831.html
* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
statx: Add a system call to make enhanced file info available
Pull block layer fixes from Jens Axboe:
"A collection of fixes for this merge window, either fixes for existing
issues, or parts that were waiting for acks to come in. This pull
request contains:
- Allocation of nvme queues on the right node from Shaohua.
This was ready long before the merge window, but waiting on an ack
from Bjorn on the PCI bit. Now that we have that, the three patches
can go in.
- Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
request allocations. This caused an oops. One part from Sagi, one
part from Omar.
- A loop partition scan deadlock fix from Omar, fixing a regression
in this merge window.
- A three-patch series from Keith, closing up a hole on clearing out
requests on shutdown/resume.
- A stable fix for nbd from Josef, fixing a leak of sockets.
- Two fixes for a regression in this window from Jan, fixing a
problem with one of his earlier patches dealing with queue vs bdi
life times.
- A fix for a regression with virtio-blk, causing an IO stall if
scheduling is used. From me.
- A fix for an io context lock ordering problem. From me"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Move bdi_unregister() to del_gendisk()
blk-mq: ensure that bd->last is always set correctly
block: don't call ioc_exit_icq() with the queue lock held for blk-mq
block: Initialize bd_bdi on inode initialization
loop: fix LO_FLAGS_PARTSCAN hang
nvme: Complete all stuck requests
blk-mq: Provide freeze queue timeout
blk-mq: Export blk_mq_freeze_queue_wait
nbd: stop leaking sockets
blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
blk-mq: kill blk_mq_set_alloc_data()
blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
nvme: allocate nvme_queue in correct node
PCI: add an API to get node from vector
blk-mq: allocate blk_mq_tags and requests in correct node
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs pile two from Al Viro:
- orangefs fix
- series of fs/namei.c cleanups from me
- VFS stuff coming from overlayfs tree
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
orangefs: Use RCU for destroy_inode
vfs: use helper for calling f_op->fsync()
mm: use helper for calling f_op->mmap()
vfs: use helpers for calling f_op->{read,write}_iter()
vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
vfs: extract common parts of {compat_,}do_readv_writev()
vfs: wrap write f_ops with file_{start,end}_write()
vfs: deny copy_file_range() for non regular files
vfs: deny fallocate() on directory
vfs: create vfs helper vfs_tmpfile()
namei.c: split unlazy_walk()
namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
lookup_fast(): clean up the logics around the fallback to non-rcu mode
namei: fold unlazy_link() into its sole caller
Pull vfs sendmsg updates from Al Viro:
"More sendmsg work.
This is a fairly separate isolated stuff (there's a continuation
around lustre, but that one was too late to soak in -next), thus the
separate pull request"
* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ncpfs: switch to sock_sendmsg()
ncpfs: don't mess with manually advancing iovec on send
ncpfs: sendmsg does *not* bugger iovec these days
ceph_tcp_sendpage(): use ITER_BVEC sendmsg
afs_send_pages(): use ITER_BVEC
rds: remove dead code
ceph: switch to sock_recvmsg()
usbip_recv(): switch to sock_recvmsg()
iscsi_target: deal with short writes on the tx side
[nbd] pass iov_iter to nbd_xmit()
[nbd] switch sock_xmit() to sock_{send,recv}msg()
[drbd] use sock_sendmsg()
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYt1rRAAoJECgfDbjSjVRpEZsIALSHevdXWtRHBZUb0ZkqPLQb
/x2Vn49CcALS1p7iSuP9L027MPeaLKyr0NBT9hptBChp/4b9lnZWyyAo6vYQrzfx
Ia/hLBYsK4ml6lEwbyfLwqkF2cmYCrZhBSVAILifn84lTPoN7CT0PlYDfA+OCaNR
geo75qF8KR+AUO0aqchwMRL3RV3OxZKxQr2AR6LttCuhiBgnV3Xqxffg/M3x6ONM
0ffFFdodm6slem3hIEiGUMwKj4NKQhcOleV+y0fVBzWfLQG9210pZbQyRBRikIL0
7IsaarpaUr7OrLAZFMGF6nJnyRAaRrt6WknTHZkyvyggrePrGcmGgPm4jrODwY4=
=2zwv
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost updates from Michael Tsirkin:
"virtio, vhost: optimizations, fixes
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-console: avoid DMA from stack
vhost: introduce O(1) vq metadata cache
virtio_scsi: use virtio IRQ affinity
virtio_blk: use virtio IRQ affinity
blk-mq: provide a default queue mapping for virtio device
virtio: provide a method to get the IRQ affinity mask for a virtqueue
virtio: allow drivers to request IRQ affinity when creating VQs
virtio_pci: simplify MSI-X setup
virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
virtio_pci: use shared interrupts for virtqueues
virtio_pci: remove struct virtio_pci_vq_info
vhost: try avoiding avail index access when getting descriptor
virtio_mmio: expose header to userspace
loop_reread_partitions() needs to do I/O, but we just froze the queue,
so we end up waiting forever. This can easily be reproduced with losetup
-P. Fix it by moving the reread to after we unfreeze the queue.
Fixes: ecdd09597a ("block/loop: fix race between I/O and set_status")
Reported-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was introduced in the multi-connection patch, we've been leaking
socket's ever since.
Fixes: 9561a7a ("nbd: add multi-connection support")
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.
Create empty placeholder header that maps to <linux/types.h>.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull IDR rewrite from Matthew Wilcox:
"The most significant part of the following is the patch to rewrite the
IDR & IDA to be clients of the radix tree. But there's much more,
including an enhancement of the IDA to be significantly more space
efficient, an IDR & IDA test suite, some improvements to the IDR API
(and driver changes to take advantage of those improvements), several
improvements to the radix tree test suite and RCU annotations.
The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
for most of the last cycle. Coupled with the IDR test suite, I feel
pretty confident that any remaining bugs are quite hard to hit. 0-day
did a great job of watching my git tree and pointing out problems; as
it hit them, I added new test-cases to be sure not to be caught the
same way twice"
Willy goes on to expand a bit on the IDR rewrite rationale:
"The radix tree and the IDR use very similar data structures.
Merging the two codebases lets us share the memory allocation pools,
and results in a net deletion of 500 lines of code. It also opens up
the possibility of exposing more of the features of the radix tree to
users of the IDR (and I have some interesting patches along those
lines waiting for 4.12)
It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
will shrink a fair few data structures that embed an IDR"
* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
radix tree test suite: Add config option for map shift
idr: Add missing __rcu annotations
radix-tree: Fix __rcu annotations
radix-tree: Add rcu_dereference and rcu_assign_pointer calls
radix tree test suite: Run iteration tests for longer
radix tree test suite: Fix split/join memory leaks
radix tree test suite: Fix leaks in regression2.c
radix tree test suite: Fix leaky tests
radix tree test suite: Enable address sanitizer
radix_tree_iter_resume: Fix out of bounds error
radix-tree: Store a pointer to the root in each node
radix-tree: Chain preallocated nodes through ->parent
radix tree test suite: Dial down verbosity with -v
radix tree test suite: Introduce kmalloc_verbose
idr: Return the deleted entry from idr_remove
radix tree test suite: Build separate binaries for some tests
ida: Use exceptional entries for small IDAs
ida: Move ida_bitmap to a percpu variable
Reimplement IDR and IDA using the radix tree
radix-tree: Add radix_tree_iter_delete
...
- support for rbd data-pool feature, which enables rbd images on
erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
allow erasure-coded profiles with k+m up to 32.
- a patch for ceph_d_revalidate() performance regression introduced in
4.9, along with some cleanups in the area (Jeff Layton)
- a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff Layton)
- buffered reads are now processed in rsize windows instead of rasize
windows (Andreas Gerstmayr). The new default for rsize mount option
is 64M.
- ack vs commit distinction is gone, greatly simplifying ->fsync() and
MOSDOpReply handling code (myself)
Also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
computations are still serialized though) and several minor fixes and
cleanups all over.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYtY0rAAoJEEp/3jgCEfOLQioH/36QKsalquY1FCdJnJve9qj0
q19OohamIedhv76AYvXhJzBBHlVwerjicE51/bSzuUhxV+ApdATrPPcLC22oLd3i
h0R9NAUMYjiris1yN/Z9JRiPCSdsxvHuRycsUMRSRbxZhnyP9XdTxFD1A+fLfisU
Z4osyTzadabVL5Um9maRBbAtXCWh3d9JZzPa5xIvWTEO4CWWk87GtEIIQDcgx+Y6
8ZSMmrVFDNtskUp9js+LnFYW7/xBsEXyqgsqKaecf5uQqwu1WKRXSKtv9PUmGAIb
HBrlUdV1PQaCzTYtaoztJshNdYcphM5L7gePzxRG0nXrTNsq8J5eCzI8en5qS8w=
=CPL/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"This time around we have:
- support for rbd data-pool feature, which enables rbd images on
erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
allow erasure-coded profiles with k+m up to 32.
- a patch for ceph_d_revalidate() performance regression introduced
in 4.9, along with some cleanups in the area (Jeff Layton)
- a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff
Layton)
- buffered reads are now processed in rsize windows instead of rasize
windows (Andreas Gerstmayr). The new default for rsize mount option
is 64M.
- ack vs commit distinction is gone, greatly simplifying ->fsync()
and MOSDOpReply handling code (myself)
... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
computations are still serialized though) and several minor fixes and
cleanups all over"
* tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits)
libceph, rbd, ceph: WRITE | ONDISK -> WRITE
libceph: get rid of ack vs commit
ceph: remove special ack vs commit behavior
ceph: tidy some white space in get_nonsnap_parent()
crush: fix dprintk compilation
crush: do is_out test only if we do not collide
ceph: remove req from unsafe list when unregistering it
rbd: constify device_type structure
rbd: kill obj_request->object_name and rbd_segment_name_cache
rbd: store and use obj_request->object_no
rbd: RBD_V{1,2}_DATA_FORMAT macros
rbd: factor out __rbd_osd_req_create()
rbd: set offset and length outside of rbd_obj_request_create()
rbd: support for data-pool feature
rbd: introduce rbd_init_layout()
rbd: use rbd_obj_bytes() more
rbd: remove now unused rbd_obj_request_wait() and helpers
rbd: switch rbd_obj_method_sync() to ceph_osdc_call()
libceph: pass reply buffer length through ceph_osdc_call()
rbd: do away with obj_request in rbd_obj_read_sync()
...
Fix typos and add the following to the scripts/spelling.txt:
algined||aligned
While we are here, fix the "appplication" in the touched line in
drivers/block/loop.c. Also, fix the "may not naturally ..." to
"may not be naturally ..." in the touched line in mm/page_alloc.
Link: http://lkml.kernel.org/r/1481573103-11329-9-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use automatic IRQ affinity assignment in the virtio layer if available,
and build the blk-mq queues based on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add a struct irq_affinity pointer to the find_vqs methods, which if set
is used to tell the PCI layer to create the MSI-X vectors for our I/O
virtqueues with the proper affinity from the start. Compared to after
the fact affinity hints this gives us an instantly working setup and
allows to allocate the irq descritors node-local and avoid interconnect
traffic. Last but not least this will allow blk-mq queues are created
based on the interrupt affinity for storage drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- misc bits
- KASAN updates
- procfs
- lib/ updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (124 commits)
checkpatch: remove false unbalanced braces warning
checkpatch: notice unbalanced else braces in a patch
checkpatch: add another old address for the FSF
checkpatch: update $logFunctions
checkpatch: warn on logging continuations
checkpatch: warn on embedded function names
lib/lz4: remove back-compat wrappers
fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version
crypto: change LZ4 modules to work with new LZ4 module version
lib/decompress_unlz4: change module to work with new LZ4 module version
lib: update LZ4 compressor module
lib/test_sort.c: make it explicitly non-modular
lib: add CONFIG_TEST_SORT to enable self-test of sort()
rbtree: use designated initializers
linux/kernel.h: fix DIV_ROUND_CLOSEST to support negative divisors
lib/find_bit.c: micro-optimise find_next_*_bit
lib: add module support to atomic64 tests
lib: add module support to glob tests
lib: add module support to crc32 tests
kernel/ksysfs.c: add __ro_after_init to bin_attribute structure
...
The idea is that without doing more calculations we extend zero pages to
same element pages for zram. zero page is special case of same element
page with zero element.
1. the test is done under android 7.0
2. startup too many applications circularly
3. sample the zero pages, same pages (none-zero element)
and total pages in function page_zero_filled
the result is listed as below:
ZERO SAME TOTAL
36214 17842 598196
ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME
AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038
STDEV 0.00674612 0.005887625 0.009707034 2.115881328
MAX 0.069698422 0.030046087 0.094975336 7.56043956
MIN 0.03959586 0.007332205 0.056055193 1.928985507
from the above data, the benefit is about 2.5% and up to 3% of total
swapout pages.
The defect of the patch is that when we recovery a page from non-zero
element the operations are low efficient for partial read.
This patch extends zero_page to same_page so if there is any user to
have monitored zero_pages, he will be surprised if the number is
increased but it's not harmful, I believe.
[minchan@kernel.org: do not free same element pages in zram_meta_free]
Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox
Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com
Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org
Signed-off-by: zhouxianrong <zhouxianrong@huawei.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_reset_device() waits for ongoing writepage pages to be completed by
zram->refcount logic. However, it's pointless because before the reset,
we prevent further opening of zram by zram->claim and flush all of
pending IO by fsync_bdev so there should be no pending IO at the
zram_reset_device().
So let's remove that code which is even broken due to the lack of
wake_up elsewhere.
Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull sparc updates from David Miller:
1) Support multiple huge page sizes, from Nitin Gupta.
2) Improve boot time on large memory configurations, from Pavel
Tatashin.
3) Make BRK handling more consistent and documented, from Vijay Kumar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix build error in flush_tsb_user_page
sparc64: memblock resizes are not handled properly
sparc64: use latency groups to improve add_node_ranges speed
sparc64: Add 64K page size support
sparc64: Multi-page size support
Documentation/sparc: Steps for sending break on sunhv console
sparc64: Send break twice from console to return to boot prom
sparc64: Migrate hvcons irq to panicked cpu
sparc64: Set cpu state to offline when stopped
sunvdc: Add support for setting physical sector size
sparc64: fix for user probes in high memory
sparc: topology_64.h: Fix condition for including cpudata.h
sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures
Pull block updates and fixes from Jens Axboe:
- NVMe updates and fixes that missed the first pull request. This
includes bug fixes, and support for autonomous power management.
- Fix from Christoph for missing clear of the request payload, causing
a problem with (at least) the storvsc driver.
- Further fixes for the queue/bdi life time issues from Jan.
- The Kconfig mq scheduler update from me.
- Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this
merge window.
- Three fixes for nbd from Josef.
- Bug fix from Omar, fixing a bug in sas transport code that oopses
when bsg ioctls were used. From Omar.
- Improvements to the queue restart and tag wait from from Omar.
- Set of fixes for the sed/opal code from Scott.
- Three trivial patches to cciss from Tobin
* 'for-linus' of git://git.kernel.dk/linux-block: (41 commits)
dm-rq: don't dereference request payload after ending request
blk-mq-sched: separate mark hctx and queue restart operations
blk-mq: use sbq wait queues instead of restart for driver tags
block/sed-opal: Propagate original error message to userland.
nvme/pci: re-check security protocol support after reset
block/sed-opal: Introduce free_opal_dev to free the structure and clean up state
nvme: detect NVMe controller in recent MacBooks
nvme-rdma: add support for host_traddr
nvmet-rdma: Fix error handling
nvmet-rdma: use nvme cm status helper
nvme-rdma: move nvme cm status helper to .h file
nvme-fc: don't bother to validate ioccsz and iorcsz
nvme/pci: No special case for queue busy on IO
nvme/core: Fix race kicking freed request_queue
nvme/pci: Disable on removal when disconnected
nvme: Enable autonomous power state transitions
nvme: Add a quirk mechanism that uses identify_ctrl
nvme: make nvmf_register_transport require a create_ctrl callback
nvme: Use CNS as 8-bit field and avoid endianness conversion
nvme: add semicolon in nvme_command setting
...
CEPH_OSD_FLAG_ONDISK is set in account_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Remove device driver failed to check map error messages
Reported-by: Johnny Bieren <jbieren@redhat.com>
Tested-by: Johnny Bieren <jbieren@redhat.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Physical sector size is supported in v1.2 of the vDisk protocol and
should be set if available. If protocol version 1.2 is used and the
physical disk size is unavailable, then the disk is considered busy.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had a deprecated_attr_warn() warning for 2 years and now the time has
come and we finally can do the cleanup.
The plan was as follows:
: per-stat sysfs attributes are considered to be deprecated.
: The basic strategy is:
: -- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
: -- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
:
: The list of deprecated attributes can be found here:
: Documentation/ABI/obsolete/sysfs-block-zram
:
: Basically, every attribute that has its own read accessible sysfs
: node (e.g. num_reads) *AND* is accessible via one of the stat files
: (zram<id>/stat or zram<id>/io_stat or zram<id>/mm_stat) is considered
: to be deprecated.
The patch also removes `obsolete/sysfs-block-zram', clean ups
`testing/sysfs-block-zram' and tweaks zram.txt files.
Link: http://lkml.kernel.org/r/20170118035838.11090-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coccinelle emits a warning about casting the return value of
kmalloc(). Coccinelle suggests removing the cast as do
kerneljanitors.
Remove cast from kmalloc() call.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Checkpatch emits ERROR:OPEN_BRACE: that open brace { should be on the
previous line.
Move open brace to new line. Also add space after if/switch statement
since we introduce more checkpatch errors if not fixed at the same
time.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYrElpAAoJELDendYovxMvNFQH/RJU7lwDSf7rF7ZzFGvdcsfi
T4DDuowkYoJm2+GypoRVzZZ0lxJlxr0mNKPvgGDvuTogMY7pvjAf6B7/xCvTFsNU
UoO2I7ljgXxCXFRiXH50nAjS7PC2PFW3Qx+8XPIWeZmnUPeJi4Q43fiSloUt+a6l
JgS/autOCflGasR5MihCZXkvdVF81K6GuEd3hCh9GKZ/8RiwNPaY50vHnMv/hfqq
SJNRKOTSRXioYlTohLnjuPWHDMayRJEO48IXl3c7aNxDTkHjn78yaoPhJ7m+M0Bq
s2GyaQA4tCwADhP+tilmI5H1vFpt6w9x7O0dgWiSm7TB91lwfZOen2WhTPPp6L8=
=gktV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes:
- a series from Boris Ostrovsky adding support for booting Linux as
Xen PVH guest
- a series from Juergen Gross streamlining the xenbus driver
- a series from Paul Durrant adding support for the new device model
hypercall
- several small corrections"
* tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add IOCTL_PRIVCMD_RESTRICT
xen/privcmd: Add IOCTL_PRIVCMD_DM_OP
xen/privcmd: return -ENOTTY for unimplemented IOCTLs
xen: optimize xenbus driver for multiple concurrent xenstore accesses
xen: modify xenstore watch event interface
xen: clean up xenbus internal headers
xenbus: Neaten xenbus_va_dev_error
xen/pvh: Use Xen's emergency_restart op for PVH guests
xen/pvh: Enable CPU hotplug
xen/pvh: PVH guests always have PV devices
xen/pvh: Initialize grant table for PVH guests
xen/pvh: Make sure we don't use ACPI_IRQ_MODEL_PIC for SCI
xen/pvh: Bootstrap PVH guest
xen/pvh: Import PVH-related Xen public interfaces
xen/x86: Remove PVH support
x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C
xen/manage: correct return value check on xenbus_scanf()
x86/xen: Fix APIC id mismatch warning on Intel
xen/netback: set default upper limit of tx/rx queues to 8
xen/netfront: set default upper limit of tx/rx queues to 8
If we fail to register the blockdev we need to make sure to destroy the
recv workqueue.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We noticed when trying to do O_DIRECT to an export on the server side
that we were getting requests smaller than the 4k sectorsize of the
device. This is because the client isn't setting the logical and
physical blocksizes properly for the underlying device. Fix this up by
setting the queue blocksizes and then calling bd_set_size.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Break the ioctl handling out into helper functions, some of these things
are getting pretty big and unwieldy.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This update includes the usual round of major driver updates (ncr5380,
ufs, lpfc, be2iscsi, hisi_sas, storvsc, cxlflash, aacraid,
megaraid_sas, ). There's also an assortment of minor fixes and the
major update of switching a bunch of drivers to pci_alloc_irq_vectors
from Christoph.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYq5adAAoJEAVr7HOZEZN4bjUP/Atk7CSZVnC75pcYmncbEGCx
ysOlEHK4uW2HhiAYk3PlYMk+pKrMHet2zsbbM9PHJfopdOHZ7Sq1+UZZVeqE1Zun
8pe0NhON+fZx7XAnevdEvnSSULQZ+AGfjZO72iUwkJiN3ozYaFtCITOyn49l4GpR
ra9emskBh7CQOFW2voGn1AKeDijPYGx3+TO4AUrWjVMiByR06gb1bmImx+ljiUrs
jzRJPfrt90ORcTdpMateyN2EXxudcASMhX03SJ6fRI84hPAhMCROMbTv8RnzOTE4
DPbnvbYUowlHt43iUhJHSwGdkRRaRBnkzQENBp1fNrNzZgF6vB7+kShxbonrYB2p
gC4ewaJr0BNj+HsUnvTpe3WseiPOcfsnBsKilPLKBlm2dCKEXqFox/dj/T1uexxg
HoyFrl3u8fyEqVHrzRS4M9t/njWh0NFmXxb0wBdj+lkVFTRErGSKQ8SfOqshuSGs
P8NN88jy8vC7uqgzKBJ+UH3ehzn3qfBxasFHIC/e2awY9FqKjHGTxKMmSVpjXVxy
wCvE2FQ3k/qEj2XSM6f7/NGytlSOlju5q1rFtHPW2M+TFSh0LJWCnmVjR/Zle9em
pBWmtIgCv8W5b41zL2H94nLWAZbfdrrNU/XnX88l47LKnmorte/PGhpxu36NEsMS
VCgreQmFMdMRY+WzDWl1
=cBQx
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
ufs, lpfc, be2iscsi, hisi_sas, storvsc, cxlflash, aacraid,
megaraid_sas, ...).
There's also an assortment of minor fixes and the major update of
switching a bunch of drivers to pci_alloc_irq_vectors from Christoph"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (188 commits)
scsi: megaraid_sas: handle dma_addr_t right on 32-bit
scsi: megaraid_sas: array overflow in megasas_dump_frame()
scsi: snic: switch to pci_irq_alloc_vectors
scsi: megaraid_sas: driver version upgrade
scsi: megaraid_sas: Change RAID_1_10_RMW_CMDS to RAID_1_PEER_CMDS and set value to 2
scsi: megaraid_sas: Indentation and smatch warning fixes
scsi: megaraid_sas: Cleanup VD_EXT_DEBUG and SPAN_DEBUG related debug prints
scsi: megaraid_sas: Increase internal command pool
scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete
scsi: megaraid_sas: Bail out the driver load if ld_list_query fails
scsi: megaraid_sas: Change build_mpt_mfi_pass_thru to return void
scsi: megaraid_sas: During OCR, if get_ctrl_info fails do not continue with OCR
scsi: megaraid_sas: Do not set fp_possible if TM capable for non-RW syspdIO, change fp_possible to bool
scsi: megaraid_sas: Remove unused pd_index from megasas_build_ld_nonrw_fusion
scsi: megaraid_sas: megasas_return_cmd does not memset IO frame to zero
scsi: megaraid_sas: max_fw_cmds are decremented twice, remove duplicate
scsi: megaraid_sas: update can_queue only if the new value is less
scsi: megaraid_sas: Change max_cmd from u32 to u16 in all functions
scsi: megaraid_sas: set pd_after_lb from MR_BuildRaidContext and initialize pDevHandle to MR_DEVHANDLE_INVALID
scsi: megaraid_sas: latest controller OCR capability from FW before sending shutdown DCMD
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
7ut0IYYEPaYUX5xFn1K6
=z7AL
-----END PGP SIGNATURE-----
Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to the declaration of device_type structure.
File size before:
text data bss dec hex filename
61546 11610 208 73364 11e94 drivers/block/rbd.o
File size after:
text data bss dec hex filename
61610 11578 208 73396 11eb4 drivers/block/rbd.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
object_no can be trivially formatted into an object name. We already
store object names in OSD requests with special care to avoid dynamic
allocations for short names. Storing a name in obj_request, obtained
as below (!), is a waste and will be removed in the next commit.
name = kmem_cache_alloc(rbd_segment_name_cache, ...);
snprintf(name, ...);
obj_request->object_name = kstrdup(name);
kmem_cache_free(rbd_segment_name_cache, name);
...
ceph_oid_aprintf(..., "%s", obj_request->object_name);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
... and also fix up the comment -- format 1 data objects have always
been 12 hex digits long.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Factor OSD request allocation and initialization code out into
__rbd_osd_req_create().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
The allocation doesn't depend on offset and length. Both offset and
length can be changed after obj_request is allocated, too.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Add support for RBD_FEATURE_DATA_POOL feature. rbd_dev->layout.pool_id
now stores the data pool id.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Rather than initializing layout fields with some made up values in
__rbd_dev_create(), move the initialization into rbd_init_layout() and
call it after the header is actually populated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Returning u64 doesn't make sense: max header->obj_order is 25 and
ceph_file_layout::object_size is u32.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
As explained in the previous commit, rbd_obj_request machinery (and
rbd_osd_req_create() in particular) shouldn't be used for working with
metadata objects.
Switch to the recently added ceph_osdc_call(). It assumes single pages
for outbound and inbound buffers, but that's OK - none of the callers
need more than that. These pages need to be allocated (messenger is in
dire need of proper iterator interface!), but we are swapping for
pages[] and pagelist allocations in the existing code.
Kill class_name argument - all rbd methods are under "rbd".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
rbd_obj_request machinery is completely unnecessary here; all that's
being done is fetching a metadata object - no striping, cloning, etc.
More importantly, rbd_osd_req_create() grabs pool id from layout and
that is becoming a data pool id.
Kill offset argument - all metadata objects are small and read in full.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
No reason to delay it until image_id is known. This will be required
by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is
changed to take oloc.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Image format 1 is deprecated and format 2 doesn't have these. Also,
__rbd_dev_create() takes care of zeroing (or otherwise initializing)
format 2 specific fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Since function tables are a common target for attackers, it's best to keep
them in read-only memory. As such, this makes the CDROM device ops tables
const. This drops additionally n_minors, since it isn't used meaningfully,
and sets the only user of cdrom_dummy_generic_packet explicitly so the
variables can all be const.
Inspired by similar changes in grsecurity/PaX.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is a relatively common idiom (8 instances) to first look up an IDR
entry, and then remove it from the tree if it is found, possibly doing
further operations upon the entry afterwards. If we change idr_remove()
to return the removed object, all of these users can save themselves a
walk of the IDR tree.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Simple cleanup to use the new APIs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Don Brace <don.brace@microsemi.com>
Tested-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Today a Xenstore watch event is delivered via a callback function
declared as:
void (*callback)(struct xenbus_watch *,
const char **vec, unsigned int len);
As all watch events only ever come with two parameters (path and token)
changing the prototype to:
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
is the natural thing to do.
Apply this change and adapt all users.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Cc: wei.liu2@citrix.com
Cc: paul.durrant@citrix.com
Cc: netdev@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
A previous commit made the bdi embedded in the request queue
a pointer, but neglected to fixup zram. Fix it up.
Fixes: dc3b17cc8b ("block: Use pointer to backing_dev_info from request_queue")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
To prepare for dynamically adding new nbd devices to the system switch
from using an array for the nbd devices and instead use an idr. This
copies what loop does for keeping track of its devices.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since we are in the memory reclaim path we need our recv work to be on a
workqueue that has WQ_MEM_RECLAIM set so we can avoid deadlocks. Also
set WQ_HIGHPRI since we are in the completion path for IO.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of keeping two levels of indirection for requests types, fold it
all into the operations. The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.
Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is where we do the rest of the request handling, which will
become much simpler soon, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Disconnects don't use block layer requests these days, so all handling
of private requests is dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The SCSI passthrough idea was a a bad idea to start with (guess who came
up with it?), and has been removed from the virtio 1.O spec, and is not
enabled by defauly by any host I know of. Add a separate config option
for it so that we don't need to enable it for most setups. That way
any bugs related to it (like the one recently fixed for vmapped stacks)
do not affect other users, and the size of the virtblk_req structure
also shrinks significantly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can simply use blk_mq_rq_from_pdu to get back at the request at
I/O completion time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We only need this code to support scsi, ide, cciss and virtio. And at
least for virtio it's a deprecated feature to start with.
This should shrink the kernel size for embedded device that only use,
say eMMC a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way there is no need to drag in a dependency on the
BLOCK_PC code, which is going to become optional.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When the lightnvm core had the "gennvm" layer between the device and the
target, there was a need for the core to be able to figure out which
target it should send an end_io callback to. Leading to a "double"
end_io, first for the media manager instance, and then for the target
instance. Now that core and gennvm is merged, there is no longer a need
for this, and a single end_io callback will do.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The number of configuration groups has been limited to one in current
code, even if there is support for up to four. With the introduction
of the open-channel SSD 1.3 specification, only a single
group is exposed onwards. Reflect this in the nvm_id structure.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data. To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Making use of "max_indirect_segments=" has issues:
- blkfront_setup_indirect() may end up with zero psegs when PAGE_SIZE
is sufficiently much larger than XEN_PAGE_SIZE
- the variable driven by the command line option
(xen_blkif_max_segments) has a somewhat different purpose, and hence
should namely never end up being zero
- as long as the specified value is lower than the legacy default,
we better don't use indirect segments at all (or we'd in fact lower
throughput)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Don't truncate the "feature-persistent" value read from xenstore: Any
non-zero value is supposed to enable the feature, just like is already
being done for feature_secdiscard.
Just like the other feature_* fields, feature_flush and feature_fua are
boolean flags, and hence fit well into a single bit.
Keep all bit fields together to limit gaps.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
A user noticed that write performance was horrible over loopback and we
traced it to an inversion of when we need to set MSG_MORE. It should be
set when we have more bvec's to send, not when we are on the last bvec.
This patch made the test go from 20 iops to 78k iops.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Fixes: 429a787be6 ("nbd: fix use-after-free of rq/bio in the xmit path")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
By general sentiment kref_sub() is a bad interface, make it go away.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide KREF_INIT() to allow static initialization of struct kref.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The raw_cmd_copyin() function does a kmalloc() with GFP_USER, although the
allocated structure is obviously not mapped to userspace, just copied from/to.
In this case GFP_KERNEL is more appropriate, so let's use it, although in the
current implementation this does not manifest as any error.
Reported-by: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.
In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574
This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".
Fixes: da9556a236 ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b4c5c60920 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit 08eee69fcf ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.
This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.
Fixes: da9556a236 ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If blk_mq_init_queue() returns an error, it gets assigned to
vblk->disk->queue. Then, when we call put_disk(), we end up calling
blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
only assigning to vblk->disk->queue on success.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Additionally, don't assign directly to disk->queue, otherwise
blk_put_queue (called via put_disk) will choke (panic) on the errno
stored there.
Bug found by code inspection after Omar found a similar issue in
virtio_blk. Compile-tested only.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel. Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Jens Axboe <axboe@fb.com>
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, with most initializer fixes
extracted from grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan). On mount, we
will now check whether any of the MDSes are available and bail rather
than block if none are. This check can be avoided by specifying the
"no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
=ypWx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan).
On mount, we will now check whether any of the MDSes are available
and bail rather than block if none are. This check can be avoided
by specifying the "no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout"
* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
libceph: remove now unused finish_request() wrapper
libceph: always signal completion when done
ceph: avoid creating orphan object when checking pool permission
ceph: properly set issue_seq for cap release
ceph: add flags parameter to send_cap_msg
ceph: update cap message struct version to 10
ceph: define new argument structure for send_cap_msg
ceph: move xattr initialzation before the encoding past the ceph_mds_caps
ceph: fix minor typo in unsafe_request_wait
ceph: record truncate size/seq for snap data writeback
ceph: check availability of mds cluster on mount
ceph: fix splice read for no Fc capability case
ceph: try getting buffer capability for readahead/fadvise
ceph: fix scheduler warning due to nested blocking
ceph: fix printing wrong return variable in ceph_direct_read_write()
crush: include mapper.h in mapper.c
rbd: silence bogus -Wmaybe-uninitialized warning
libceph: no need to drop con->mutex for ->get_authorizer()
libceph: drop len argument of *verify_authorizer_reply()
libceph: verify authorize reply on connect
...
This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas). There's also
an assortment of minor fixes, mostly in error legs or other not very
user visible stuff. The major change is the pci_alloc_irq_vectors
replacement for the old pci_msix_.. calls; this effectively makes IRQ
mapping generic for the drivers and allows blk_mq to use the
information.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYUJOvAAoJEAVr7HOZEZN42B0P/1lj1W2N7y0LOAbR2MasyQvT
fMD/SSip/v+R/zJiTv+5M/IDQT6ez62JnQGWyX3HZTob9VEfoqagbUuHH6y+fmib
tHyqiYc9FC/NZSRX/0ib+rpnSVhC/YRSVV7RrAqilbpOKAzeU25FlN/vbz+Nv/XL
ReVBl+2nGjJtHyWqUN45Zuf74c0zXOWPPUD0hRaNclK5CfZv5wDYupzHzTNSQTkj
nWvwPYT0OgSMNe7mR+IDFyOe3UQ/OYyeJB0yBNqO63IiaUabT8/hgrWR9qJAvWh8
LiH+iSQ69+sDUnvWvFjuls/GzcFuuTljwJbm+FyTsmNHONPVY8JRCLNq7CNDJ6Vx
HwpNuJdTSJpne4lAVBGPwgjs+GhlMvUP/xYVLWAXdaBxU9XGePrwqQDcFu1Rbx3P
yfMiVaY1+e45OEjLRCbDAwFnMPevC3kyymIvSsTySJxhTbYrOsyrrWt5kwWsvE3r
SKANsub+xUnpCkyg57nXRQStJSCiSfGIDsydKmMX+pf1SR4k6gCUQZlcchUX0uOa
dcY6re0c7EJIQQiT7qeGP5TRBblxARocCA/Igx6b5U5HmuQ48tDFlMCps7/TE84V
JBsBnmkXcEi/ALShL/Tui+3YKA1DfOtEnwHtXx/9Ecx/nxP2Sjr9LJwCKiONv8NY
RgLpGfccrix34lQumOk5
=sPXh
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).
There's also an assortment of minor fixes, mostly in error legs or
other not very user visible stuff. The major change is the
pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
effectively makes IRQ mapping generic for the drivers and allows
blk_mq to use the information"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
scsi: qla4xxx: switch to pci_alloc_irq_vectors
scsi: hisi_sas: support deferred probe for v2 hw
scsi: megaraid_sas: switch to pci_alloc_irq_vectors
scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
scsi: be2iscsi: set errno on error path
scsi: be2iscsi: set errno on error path
scsi: hpsa: fallback to use legacy REPORT PHYS command
scsi: scsi_dh_alua: Fix RCU annotations
scsi: hpsa: use %phN for short hex dumps
scsi: hisi_sas: fix free'ing in probe and remove
scsi: isci: switch to pci_alloc_irq_vectors
scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
scsi: dpt_i2o: double free on error path
scsi: cxlflash: Migrate scsi command pointer to AFU command
scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
scsi: cxlflash: Cleanup queuecommand()
scsi: cxlflash: Cleanup send_tmf()
scsi: cxlflash: Remove AFU command lock
scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
scsi: cxlflash: Remove private command pool
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYT5HMAAoJELDendYovxMvhNQH/1g/3ahM4JKN8Z0SbjKBEdQm
yj2xOj6cE3l6wMSUblKjZD2DLLhpmcHT/E97Xro/lZQEfQJoMXXWWDFowMU/P1LA
mJxb7Fzq5Wr+6eGSAlIQB270MrpNi/luf+CWHMwVA3V7R3KRXwonOdGQSkISIzCd
tgIydEA3a9r2+HgeIBpZFZ4GcSrJQU75krMyl2tjD1C+jeYVd+zdoj2OnDsZQDZQ
hDWApMpNbpSBAn7JtSSdXWSTBsGH0lUECebeYPhPQ2sX2P6Y8+UCGwA7i6FFdbTa
agXfVSdRz8dCe3k19VcKDAw6nK9BTTMnEeEHmkmygIh6wuHPP44CzigTXIbJoXI=
=zjfm
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes for 4.10
These are some fixes, a move of some arm related headers to share them
between arm and arm64 and a series introducing a helper to make code
more readable.
The most notable change is David stepping down as maintainer of the
Xen hypervisor interface. This results in me sending you the pull
requests for Xen related code from now on"
* tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
xen/balloon: Only mark a page as managed when it is released
xenbus: fix deadlock on writes to /proc/xen/xenbus
xen/scsifront: don't request a slot on the ring until request is ready
xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
x86: Make E820_X_MAX unconditionally larger than E820MAX
xen/pci: Bubble up error and fix description.
xen: xenbus: set error code on failure
xen: set error code on failures
arm/xen: Use alloc_percpu rather than __alloc_percpu
arm/arm64: xen: Move shared architecture headers to include/xen/arm
xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
xen-scsifront: Add a missing call to kfree
MAINTAINERS: update XEN HYPERVISOR INTERFACE
xenfs: Use proc_create_mount_point() to create /proc/xen
xen-platform: use builtin_pci_driver
xen-netback: fix error handling output
xen: make use of xenbus_read_unsigned() in xenbus
xen: make use of xenbus_read_unsigned() in xen-pciback
xen: make use of xenbus_read_unsigned() in xen-fbfront
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
drivers/block/rbd.c: In function ‘rbd_watch_cb’:
drivers/block/rbd.c:3690:5: error: ‘struct_v’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
drivers/block/rbd.c:3759:5: note: ‘struct_v’ was declared here
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While doing stress tests we noticed that we'd get a lot of dmesg spam if
we suddenly disconnected the nbd device out of band. Rate limit the
messages in the io path in order to deal with this.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If an app exits before running NBD_DO_IT but after adding sockets we can
end up not being allowed to do a new nbd device. Fix this by making
NBD_CLEAR_SOCK reset the setup_task.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
zram hot_add sysfs attribute is a very 'special' attribute - reading
from it creates a new uninitialized zram device. This file, by a
mistake, can be read by a 'normal' user at the moment, while only root
must be able to create a new zram device, therefore hot_add attribute
must have S_IRUSR mode, not S_IRUGO.
[akpm@linux-foundation.org: s/sence/sense/, reflow comment to use 80 cols]
Fixes: 6566d1a32b ("zram: add dynamic device add/remove functionality")
Link: http://lkml.kernel.org/r/20161205155845.20129-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Steven Allen <steven@stebalien.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have this:
ERROR: "__aeabi_ldivmod" [drivers/block/nbd.ko] undefined!
ERROR: "__divdi3" [drivers/block/nbd.ko] undefined!
nbd.c:(.text+0x247c72): undefined reference to `__divdi3'
due to a recent commit, that did 64-bit division. Use the proper
divider function so that 32-bit compiles don't break.
Fixes: ef77b51524 ("nbd: use loff_t for blocksize and nbd_set_size args")
Signed-off-by: Jens Axboe <axboe@fb.com>
If we have large devices (say like the 40t drive I was trying to test with) we
will end up overflowing the int arguments to nbd_set_size and not get the right
size for our device. Fix this by using loff_t everywhere so I don't have to
think about this again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Install the callbacks via the state machine with multi instance support and let
the core invoke the callbacks on the already online CPUs.
[bigeasy: wire up the multi instance stuff]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: rt@linutronix.de
Cc: Nitin Gupta <ngupta@vflare.org>
Link: http://lkml.kernel.org/r/20161126231350.10321-19-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix bug https://bugzilla.kernel.org/show_bug.cgi?id=188531. In function
mtip_block_initialize(), variable rv takes the return value, and its
value should be negative on errors. rv is initialized as 0 and is not
reset when the call to ida_pre_get() fails. So 0 may be returned.
The return value 0 indicates that there is no error, which may be
inconsistent with the execution status. This patch fixes the bug by
explicitly assigning -ENOMEM to rv on the branch that ida_pre_get()
fails.
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The zram hot removal code calls idr_remove() even when zram_remove()
returns an error (typically -EBUSY). This results in a leftover at the
device release, eventually leading to a crash when the module is
reloaded.
As described in the bug report below, the following procedure would
cause an Oops with zram:
- provision three zram devices via modprobe zram num_devices=3
- configure a size for each device
+ echo "1G" > /sys/block/$zram_name/disksize
- mkfs and mount zram0 only
- attempt to hot remove all three devices
+ echo 2 > /sys/class/zram-control/hot_remove
+ echo 1 > /sys/class/zram-control/hot_remove
+ echo 0 > /sys/class/zram-control/hot_remove
- zram0 removal fails with EBUSY, as expected
- unmount zram0
- try zram0 hot remove again
+ echo 0 > /sys/class/zram-control/hot_remove
- fails with ENODEV (unexpected)
- unload zram kernel module
+ completes successfully
- zram0 device node still exists
- attempt to mount /dev/zram0
+ mount command is killed
+ following BUG is encountered
BUG: unable to handle kernel paging request at ffffffffa0002ba0
IP: get_disk+0x16/0x50
Oops: 0000 [#1] SMP
CPU: 0 PID: 252 Comm: mount Not tainted 4.9.0-rc6 #176
Call Trace:
exact_lock+0xc/0x20
kobj_lookup+0xdc/0x160
get_gendisk+0x2f/0x110
__blkdev_get+0x10c/0x3c0
blkdev_get+0x19d/0x2e0
blkdev_open+0x56/0x70
do_dentry_open.isra.19+0x1ff/0x310
vfs_open+0x43/0x60
path_openat+0x2c9/0xf30
do_filp_open+0x79/0xd0
do_sys_open+0x114/0x1e0
SyS_open+0x19/0x20
entry_SYSCALL_64_fastpath+0x13/0x94
This patch adds the proper error check in hot_remove_store() not to call
idr_remove() unconditionally.
Fixes: 17ec4cd985 ("zram: don't call idr_remove() from zram_remove()")
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1010970
Link: http://lkml.kernel.org/r/20161121132140.12683-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reported-by: David Disseldorp <ddiss@suse.de>
Tested-by: David Disseldorp <ddiss@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Multiple paths don't set it properly, ensure that we do.
Fixes: 9561a7ade0 ("nbd: add multi-connection support")
Signed-off-by: Jens Axboe <axboe@fb.com>
NBD can become contended on its single connection. We have to serialize all
writes and we can only process one read response at a time. Fix this by
allowing userspace to provide multiple connections to a single nbd device. This
coupled with block-mq drastically increases performance in multi-process cases.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For a non-cloned bio, bio_add_page() only returns failure when
the io vec table is full, but in that case, bio->bi_vcnt can't
be zero at all.
So remove the impossible failure handling.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.
After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up the new O_DIRECT cases.
Signed-off-by: Jens Axboe <axboe@fb.com>
This driver is both orphaned, and not really useful anymore. Mark
it as such, and remove it in a future kernel after a release or
two.
Signed-off-by: Jens Axboe <axboe@fb.com>
For writes, we can get a completion in while we're still iterating
the request and bio chain. If that happens, we're reading freed
memory and we can crash.
Break out after the last segment and avoid having the iterator
read freed memory.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.
Add the appropriate error message.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Massaged the text a bit.
Signed-off-by: Jens Axboe <axboe@fb.com>
->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
an errno.
f4aa4c7bba ("block: loop: convert to per-device workqueue")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
aoeblk contains some mysterious code, that wants to elevate the bio
vec page counts while it's under IO. That is not needed, it's
fragile, and it's causing kernel oopses for some.
Reported-by: Tested-by: Don Koch <kochd@us.ibm.com>
Tested-by: Tested-by: Don Koch <kochd@us.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to call kfree() if memdup_user() fails, as no memory
was allocated and the error in the error-valued pointer should be returned.
Signed-off-by: Sachin Shukla <sachin.s5@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Building with W=1 shows a harmless warning for the skd driver:
drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration]
This changes the prototype to the expected formatting.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
As reported by gcc -Wmaybe-uninitialized, the cleanup path for
skd_acquire_msix tries to free the already allocated msi-x vectors
in reverse order, but the index variable may not have been
used yet:
drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
This changes the failure path to skip releasing the interrupts
if we have not started requesting them yet.
Fixes: 180b0ae77d ("skd: use pci_alloc_irq_vectors")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't pass a size larger than iov_len to kernel_sendmsg().
Otherwise it will cause a NULL pointer deref when kernel_sendmsg()
returns with rv < size.
DRBD as external module has been around in the kernel 2.4 days already.
We used to be compatible to 2.4 and very early 2.6 kernels,
we used to use
rv = sock_sendmsg(sock, &msg, iov.iov_len);
then later changed to
rv = kernel_sendmsg(sock, &msg, &iov, 1, size);
when we should have used
rv = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
tcp_sendmsg() used to totally ignore the size parameter.
57be5bd ip: convert tcp_sendmsg() to iov_iter primitives
changes that, and exposes our long standing error.
Even with this error exposed, to trigger the bug, we would need to have
an environment (config or otherwise) causing us to not use sendpage()
for larger transfers, a failing connection, and have it fail "just at the
right time". Apparently that was unlikely enough for most, so this went
unnoticed for years.
Still, it is known to trigger at least some of these,
and suspected for the others:
[0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html
[1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html
[2] https://forums.grsecurity.net/viewtopic.php?f=3&t=4546
[3] https://ubuntuforums.org/showthread.php?t=2336150
[4] http://e2.howsolveproblem.com/i/1175162/
This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.
It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).
It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().
Fixes: b411b3637f ("The DRBD driver")
Cc: <stable@vger.kernel.org> # 2.6.33.x-
Cc: viro@zeniv.linux.org.uk
Cc: christoph.lechleitner@iteg.at
Cc: wolfgang.glas@iteg.at
Reported-by: Christoph Lechleitner <christoph.lechleitner@iteg.at>
Tested-by: Christoph Lechleitner <christoph.lechleitner@iteg.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
[changed oneliner to be "obvious" without context; more verbose message]
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For small buffers we may use %*ph[N] specifier, for the bigger blocks
print_hex_dump() call.
Cc: Don Brace <don.brace@microsemi.com>
Cc: esc.storagedev@microsemi.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Replace custom approach by %*ph specifier to dump small buffers in hex format.
Unfortunately we can't use print_hex_dump_bytes() here since tha gap is
present, though one familiar with the code may change this.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to
pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not
multiple MSI vectors.
Otherwise this cleans up a lot of cruft and allows to a lot more common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hi Peter, hi Jens,
I've been looking over the multi page bio vec work again recently, and
one of the stumbling blocks is raw biovec access in the pktcdvd.
The first issue is that it directly sets up the page and offset pointers
in the biovec just before calling bio_add_page. As bio_add_page already
does the setup it's trivial to just switch it to stack variables for the
arguments.
The second issue is the copy code in pkt_make_local_copy, which
effectively is an opencoded version of bio_copy_data except that it
skips pages that already are the same in the ѕource and destination.
But we look at the only calleer we just set up the bio using bio_add_page
to point exactly to the page array that pkt_make_local_copy compares,
so the pages will always be the same and we can just remove this function.
Note that all this is done based on code inspection, I don't have any
packet writing hardware myself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use xenbus_read_unsigned() instead of xenbus_scanf() when possible.
This requires to change the type of some reads from int to unsigned,
but these cases have been wrong before: negative values are not allowed
for the modified cases.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Use xenbus_read_unsigned() instead of xenbus_scanf() when possible.
This requires to change the type of one read from int to unsigned,
but this case has been wrong before: negative values are not allowed
for the modified case.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
'blk_mq_alloc_request()' returns an error pointer in case of error, not
NULL. So test it with IS_ERR.
Fixes: fd8383fd88 ("nbd: convert to blkmq")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.
This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The local variable "err" will be set to an appropriate value
by a following statement.
Thus omit the explicit initialisation at the beginning.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It makes the message hard to interpret correctly if a base 10 number is
prefixed by 0x. So change to a hex number.
Link: http://lkml.kernel.org/r/20161026125658.25728-3-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Discontinue having the brd driver destructively free all pages in the
ramdisk in response to the BLKFLSBUF ioctl. Doing so allows a BLKFLSBUF
ioctl issued to a logical partition to destroy pages of the parent brd
device (and all other partitions of that brd device).
This change breaks compatibility - but in this case the compatibility
breaks more than it helps.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently rd_size was int which lead to overflow and bogus device size
once the requested ramdisk size was 1 TB or more. Although these days
ramdisks with 1 TB size are mostly a mistake, the days when they are
useful are not far.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 0eadf37afc ("nbd: allow block mq to deal with timeouts")
changed normal usage of nbd->sock_lock to use spin_lock/spin_unlock
rather than the *_irq variants, but it missed this unlock in an
error path.
Found by Coverity, CID 1373871.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Fixes: 0eadf37afc ("nbd: allow block mq to deal with timeouts")
Signed-off-by: Jens Axboe <axboe@fb.com>
If the header object gets deleted (perhaps along with the entire pool),
there is no point in attempting to reregister the watch. Treat this
the same as blacklisting: fail all pending and new I/Os requiring the
lock.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-EBLACKLISTED from __rbd_register_watch() means that our ceph_client
got blacklisted - we won't be able to restore the watch and reacquire
the lock. Wake up and fail all outstanding requests waiting for the
lock and arrange for all new requests that require the lock to fail
immediately.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Mike Christie <mchristi@redhat.com>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
with maintenance operations offloaded to userspace (Douglas Fuller,
Mike Christie and myself). Another block device bullet is a series
fixing up layering error paths (myself).
On the filesystem side, we've got patches that improve our handling of
buffered vs dio write races (Neil Brown) and a few assorted fixes from
Zheng. Also included a couple of random cleanups and a minor CRUSH
update.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun
kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj
HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I
eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs
OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD
RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A=
=45O7
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client
Pull Ceph updates from Ilya Dryomov:
"The big ticket item here is support for rbd exclusive-lock feature,
with maintenance operations offloaded to userspace (Douglas Fuller,
Mike Christie and myself). Another block device bullet is a series
fixing up layering error paths (myself).
On the filesystem side, we've got patches that improve our handling of
buffered vs dio write races (Neil Brown) and a few assorted fixes from
Zheng. Also included a couple of random cleanups and a minor CRUSH
update"
* tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
crush: remove redundant local variable
crush: don't normalize input of crush_ln iteratively
libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
ceph: fix description for rsize and rasize mount options
rbd: use kmalloc_array() in rbd_header_from_disk()
ceph: use list_move instead of list_del/list_add
ceph: handle CEPH_SESSION_REJECT message
ceph: avoid accessing / when mounting a subpath
ceph: fix mandatory flock check
ceph: remove warning when ceph_releasepage() is called on dirty page
ceph: ignore error from invalidate_inode_pages2_range() in direct write
ceph: fix error handling of start_read()
rbd: add rbd_obj_request_error() helper
rbd: img_data requests don't own their page array
rbd: don't call rbd_osd_req_format_read() for !img_data requests
rbd: rework rbd_img_obj_exists_submit() error paths
rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
rbd: move bumping img_request refcount into rbd_obj_request_submit()
rbd: mark the original request as done if stat request fails
...
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
* Delete the local variable "size" which became unnecessary with
this refactoring.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull setting an error and marking a request done code into a new
helper. obj_request_img_data_test() check isn't strictly needed right
now, but makes it applicable to !img_data requests and a bit safer.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move the check into rbd_obj_request_destroy() to avoid use-after-free
on errors in rbd_img_request_fill(..., OBJ_REQUEST_PAGES, ...), where
pages, owned by the caller, gets freed in rbd_img_request_fill().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Accessing obj_request->img_request union field is only valid for object
requests associated with an image (i.e. if obj_request_img_data_test()
returns true). rbd_osd_req_format_read() used to do more, but now it
just sets osd_req->snap_id. Standalone and stat object requests always
go to the HEAD revision and are fine with CEPH_NOSNAP set by libceph,
so get around the invalid union field use by simply not calling
rbd_osd_req_format_read() in those places.
Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
- don't put obj_request before rbd_obj_request_get() if
rbd_obj_request_create() fails
- don't leak pages if rbd_obj_request_create() fails
- don't leak stat_request if rbd_osd_req_create() fails
Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
- fix parent_length == img_request->xferred assert to not fire on
copyup read failures
- don't leak pages if copyup read fails or we can't allocate a new osd
request
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Commit 0f2d5be792 ("rbd: use reference counts for image requests")
added rbd_img_request_get(), which rbd_img_request_fill() calls for
each obj_request added to img_request. It was an urgent band-aid for
the uglyness that is rbd_img_obj_callback() and none of the error paths
were updated.
Given that this img_request reference is meant to represent an
obj_request that hasn't passed through rbd_img_obj_callback() yet,
proper cleanup in appropriate destructors is a challenge. However,
noting that if we don't get a chance to call rbd_obj_request_complete(),
there is not going to be a call to rbd_img_obj_callback(), we can move
rbd_img_request_get() into rbd_obj_request_submit() and fixup the two
places that call rbd_obj_request_complete() directly and not through
rbd_obj_request_submit() to temporarily bump img_request, so that
rbd_img_obj_callback() can put as usual.
This takes care of img_request leaks on errors on the submit side.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
If stat request fails with something other than -ENOENT (which just
means that we need to copyup), the original object request is never
marked as done and therefore never completed. Fix this by moving the
mark done + complete snippet from rbd_img_obj_parent_read_full() into
rbd_img_obj_exists_callback(). The former remains covered, as the
latter is its only caller (through rbd_img_obj_request_submit()).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Assert once in rbd_img_obj_request_submit().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Add a per-device option to acquire exclusive lock on reads (in addition
to writes and discards). The use case is iSCSI, where it will be used
to prevent execution of stale writes after the implicit failover.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Mike Christie <mchristi@redhat.com>
We take a mutex when sending commands and send stuff over the network, we need
to have queue_rq called asynchronously.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Fixes: fd8383fd88 ("nbd: convert to blkmq")
Signed-off-by: Jens Axboe <axboe@fb.com>
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.
To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.
This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
With LightNVM enabled devices, the gendisk structure is not exposed
to the user. This hides the device driver specific sysfs entries, and
prevents binding of LightNVM geometry information to the device.
Refactor the device registration process, so that gendisk and
non-gendisk devices are easily managed.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.
This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of rolling our own timer, just utilize the blk mq req timeout and do the
disconnect if any of our commands timeout.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In preparation for some future changes, change a few of the state bools over to
normal bits to set/clear properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We hit a warning when shutting down the nbd connection because we have irq's
disabled. We don't really need to do the shutdown under the lock, just clear
the nbd->sock. So do the shutdown outside of the irq. This gets rid of the
warning.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This moves NBD over to using blkmq, which allows us to get rid of the NBD
wide queue lock and the async submit kthread. We will start with 1 hw
queue for now, but I plan to add multiple tcp connection support in the
future and we'll fix how we set the hwqueue's.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We get 1 warning when biuld kernel with W=1:
drivers/block/mtip32xx/mtip32xx.c:3689:6: warning: no previous prototype for
'mtip_block_release' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is declared
and don't need a declaration, but can be made static.
so this patch marks it 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a force close option, so we can force the unmapping
of a rbd device that is open. If a path/device is blacklisted, apps
like multipathd can map a new device and then unmap the old one.
The unmapping cleanup would then be handled by the generic hotunplug
code paths in multipahd like is done for iSCSI, FC/FCOE, SAS, etc.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export the info used to setup the rbd image, so it can be used to remap
the image.
Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: do_rbd_add() EH]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export snap id in sysfs, so tools like multipathd can use it in a uuid.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export the cluster fsid, so tools like udev and multipath-tools can use
it for part of the uuid.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export client addr/nonce, so userspace can check if a image is being
blacklisted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: ceph_client_addr(), endianess fix]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With exclusive-lock added and more to come, print features into dmesg.
Change capacity to decimal while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Add basic support for RBD_FEATURE_EXCLUSIVE_LOCK feature. Maintenance
operations (resize, snapshot create, etc) are offloaded to librbd via
returning -EOPNOTSUPP - librbd should request the lock and execute the
operation.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
Revamp watch code to support retrying watch re-registration:
- add rbd_dev->watch_state for more robust errcb handling
- store watch cookie separately to avoid dereferencing watch_handle
which is set to NULL on unwatch
- move re-register code into a delayed work and retry re-registration
every second, unless the client is blacklisted
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
This is going to be used for re-registering watch requests and
exclusive-lock tasks: acquire/request lock, notify-acquired, release
lock, notify-released. Some refactoring in the map/unmap paths was
necessary to give this workqueue a meaningful name: "rbdX-tasks".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
It's gid / global_id in other places.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- Misc fixes and cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXq0ruAAoJECgfDbjSjVRp5P8H/2OlDJdSS1l+TwOXbY95ntQ1
vxUX4vGCX5IujC+Rbt7sQV2prE3b6IktFNagpbRoWn21JkpoDMvPtYJrn5BhLtoh
fvDkZE6Wo3QztFSjaUBZWEABBt03KPX0yrAIZplu8ne/Z8KAT3zK57BPnKfmxwv+
dpxt+1wlnqAvYsoUUQZBFT4Gmk2oDiTofiIbQq7W9W/fooznLtLB+ArYtdfNJizC
JnI/vJuWceEXfjT26HexCRhA2OZskrA4ZadDhOjAqkTPN5DHfweLDuHh7IsVfDd1
wXqjc4ks3cYG0CloJ2qY2K7RpDOFIxIizixeDIuAbn9aX4sPOYYfqRm+4iRwmqQ=
=9aUO
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost fixes and cleanups from Michael Tsirkin:
"Misc fixes and cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio/s390: deprecate old transport
virtio/s390: keep early_put_chars
virtio_blk: Fix a slient kernel panic
virtio-vsock: fix include guard typo
vhost/vsock: fix vhost virtio_vsock_pkt use-after-free
9p/trans_virtio: use kvfree() for iov_iter_get_pages_alloc()
virtio: fix error handling for debug builds
virtio: fix memory leak in virtqueue_add()
ceph_file_layout::pool_id is now s64. rbd_add_get_pool_id() and
ceph_pg_poolid_by_name() both return an int, so it's bogus anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
We do a lot of memory allocation in function init_vq, and don't handle
the allocation failure properly. Then this function will return 0,
although initialization fails due to lacking memory. At that moment,
kernel will panic in guest machine, if virtio is used to drive disk.
To fix this bug, we should take care of allocation failure, and return
correct value to let caller know what happen.
Tested-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Minfei Huang <minfei.hmf@alibaba-inc.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Purely cosmetic at this point, as rbd doesn't use RADOS namespaces and
hence rbd_dev->header_oloc->pool_ns is always NULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.
Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")
Modified by me to:
1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix a fat-fingered conversion to the req_op accessors, and also
use a switch statement to make it more obvious what is being checked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: c2df40 ("drivers: use req op accessor");
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Quentin ran into this bug:
WARNING: CPU: 64 PID: 10085 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x65/0x80
sysfs: cannot create duplicate filename '/devices/virtual/block/nbd3/pid'
Modules linked in: nbd
CPU: 64 PID: 10085 Comm: qemu-nbd Tainted: G D 4.6.0+ #7
0000000000000000 ffff8820330bba68 ffffffff814b8791 ffff8820330bbac8
0000000000000000 ffff8820330bbab8 ffffffff810d04ab ffff8820330bbaa8
0000001f00000296 0000000000017681 ffff8810380bf000 ffffffffa0001790
Call Trace:
[<ffffffff814b8791>] dump_stack+0x4d/0x6c
[<ffffffff810d04ab>] __warn+0xdb/0x100
[<ffffffff810d0574>] warn_slowpath_fmt+0x44/0x50
[<ffffffff81218c65>] sysfs_warn_dup+0x65/0x80
[<ffffffff81218a02>] sysfs_add_file_mode_ns+0x172/0x180
[<ffffffff81218a35>] sysfs_create_file_ns+0x25/0x30
[<ffffffff81594a76>] device_create_file+0x36/0x90
[<ffffffffa0000e8d>] __nbd_ioctl+0x32d/0x9b0 [nbd]
[<ffffffff814cc8e8>] ? find_next_bit+0x18/0x20
[<ffffffff810f7c29>] ? select_idle_sibling+0xe9/0x120
[<ffffffff810f6cd7>] ? __enqueue_entity+0x67/0x70
[<ffffffff810f9bf0>] ? enqueue_task_fair+0x630/0xe20
[<ffffffff810efa76>] ? resched_curr+0x36/0x70
[<ffffffff810f0078>] ? check_preempt_curr+0x78/0x90
[<ffffffff810f00a2>] ? ttwu_do_wakeup+0x12/0x80
[<ffffffff810f01b1>] ? ttwu_do_activate.constprop.86+0x61/0x70
[<ffffffff810f0c15>] ? try_to_wake_up+0x185/0x2d0
[<ffffffff810f0d6d>] ? default_wake_function+0xd/0x10
[<ffffffff81105471>] ? autoremove_wake_function+0x11/0x40
[<ffffffffa0001577>] nbd_ioctl+0x67/0x94 [nbd]
[<ffffffff814ac0fd>] blkdev_ioctl+0x14d/0x940
[<ffffffff811b0da2>] ? put_pipe_info+0x22/0x60
[<ffffffff811d96cc>] block_ioctl+0x3c/0x40
[<ffffffff811ba08d>] do_vfs_ioctl+0x8d/0x5e0
[<ffffffff811aa329>] ? ____fput+0x9/0x10
[<ffffffff810e9092>] ? task_work_run+0x72/0x90
[<ffffffff811ba627>] SyS_ioctl+0x47/0x80
[<ffffffff8185f5df>] entry_SYSCALL_64_fastpath+0x17/0x93
---[ end trace 7899b295e4f850c8 ]---
It seems fairly obvious that device_create_file() is not being protected
from being run concurrently on the same nbd.
Quentin found the following relevant commits:
1a2ad21 nbd: add locking to nbd_ioctl
90b8f28 [PATCH] end of methods switch: remove the old ones
d4430d6 [PATCH] beginning of methods conversion
08f8585 [PATCH] move block_device_operations to blkdev.h
It would seem that the race was introduced in the process of moving nbd
from BKL to unlocked ioctls.
By setting nbd->task_recv while the mutex is held, we can prevent other
processes from running concurrently (since nbd->task_recv is also checked
while the mutex is held).
Reported-and-tested-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Pavel Machek <pavel@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 09954bad4 ("floppy: refactor open() flags handling"), as a
side-effect, causes open(/dev/fdX, O_ACCMODE) to fail. It turns out that
this is being used setfdprm userspace for ioctl-only open().
Reintroduce back the original behavior wrt !(FMODE_READ|FMODE_WRITE)
modes, while still keeping the original O_NDELAY bug fixed.
Cc: stable@vger.kernel.org # v4.5+
Reported-by: Wim Osterholt <wim@djo.tudelft.nl>
Tested-by: Wim Osterholt <wim@djo.tudelft.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge yet more updates from Andrew Morton:
- the rest of ocfs2
- various hotfixes, mainly MM
- quite a bit of misc stuff - drivers, fork, exec, signals, etc.
- printk updates
- firmware
- checkpatch
- nilfs2
- more kexec stuff than usual
- rapidio updates
- w1 things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
ipc: delete "nr_ipc_ns"
kcov: allow more fine-grained coverage instrumentation
init/Kconfig: add clarification for out-of-tree modules
config: add android config fragments
init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
relay: add global mode support for buffer-only channels
init: allow blacklisting of module_init functions
w1:omap_hdq: fix regression
w1: add helper macro module_w1_family
w1: remove need for ida and use PLATFORM_DEVID_AUTO
rapidio/switches: add driver for IDT gen3 switches
powerpc/fsl_rio: apply changes for RIO spec rev 3
rapidio: modify for rev.3 specification changes
rapidio: change inbound window size type to u64
rapidio/idt_gen2: fix locking warning
rapidio: fix error handling in mbox request/release functions
rapidio/tsi721_dma: advance queue processing from transfer submit call
rapidio/tsi721: add messaging mbox selector parameter
rapidio/tsi721: add PCIe MRRS override parameter
rapidio/tsi721_dma: add channel mask and queue size parameters
...
* RADOS namespace support in libceph and CephFS (Zheng Yan and myself).
The stopgaps added in 4.5 to deny access to inodes in namespaces are
removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully
supported.
* A large rework of the MDS cap flushing code (Zheng Yan).
* Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU.
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXoKLJAAoJEEp/3jgCEfOLDTUIAIcctpKUiNBokc95mQaXYl34
j7lPIaD0/Ur7JPt4nMdtlywYJYSVV2c+SglHztj/+fv0G4bWbLVEFRruh9SwKIci
PzttcmycIAqSn1f5gBZwyQbGuffd/F0EnBj7fFjcukt01i3s1ZQ7t4XtLGtAV0Ts
aIfFtx9SqWig57Z1OZqNgnhnOoh6IqNbic3FL5Hvdl5N5pFbBcQho6Vzoa5O1osH
URG6RmCcO4nykfSoxiivE7UZ+CImsXHkRD7rupBuIjqjZ8wvmZqQF5qxnkb9Dw2F
IkNhrHkTSIiv4EsNPLAETTnFSozrL1nEykKr2FBW+ti8nxNcav+8FgVapqLvFIw=
=gQ0/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client
Pull Ceph updates from Ilya Dryomov:
"The highlights are:
- RADOS namespace support in libceph and CephFS (Zheng Yan and
myself). The stopgaps added in 4.5 to deny access to inodes in
namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature
bit is now fully supported
- A large rework of the MDS cap flushing code (Zheng Yan)
- Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue"
* tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits)
ceph: fix symbol versioning for ceph_monc_do_statfs
ceph: Correctly return NXIO errors from ceph_llseek
ceph: Mark the file cache as unreclaimable
ceph: optimize cap flush waiting
ceph: cleanup ceph_flush_snaps()
ceph: kick cap flushes before sending other cap message
ceph: introduce an inode flag to indicates if snapflush is needed
ceph: avoid sending duplicated cap flush message
ceph: unify cap flush and snapcap flush
ceph: use list instead of rbtree to track cap flushes
ceph: update types of some local varibles
ceph: include 'follows' of pending snapflush in cap reconnect message
ceph: update cap reconnect message to version 3
ceph: mount non-default filesystem by name
libceph: fsmap.user subscription support
ceph: handle LOOKUP_RCU in ceph_d_revalidate
ceph: allow dentry_lease_is_valid to work under RCU walk
ceph: clear d_fsinfo pointer under d_lock
ceph: remove ceph_mdsc_lease_release
ceph: don't use ->d_time
...
kernel.h header doesn't directly use dynamic debug, instead we can
include it in module.c (which used it via kernel.h). printk.h only uses
it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
in that case.
Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
[luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.
Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.
Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.
The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXmLlrAAoJEFxbo/MsZsTRvRQH/1wOMF8BmlbZfR7H3qwDfjst
ApNifCiZE08xDtWBlwUaBFAQxyflQS9BBiNZDVK0sysIdXeOdpWV7V0ZjRoLL+xr
czsaaGXDcmXxJxApoMDVuT7FeP6rEk6LVAYRoHpVjJjMZGW3BbX1vZaMW4DXl2WM
9YNaF2Lj+rpc1f8iG31nUxwkpmcXFog6ct4tu7HiyCFT3hDkHt/a4ghuBdQItCkd
vqBa1pTpcGtQBhSmWzlylN/PV2+NKcRd+kGiwd09/O/rNzogTMCTTWeHKAtMpPYb
Cu6oSqJtlK5o0vtr0qyLSWEGIoyjE2gE92s0wN3iCzFY1PldqdsxUO622nIj+6o=
=G6q3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Features and fixes for 4.8-rc0:
- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places"
* tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
xen: add static initialization of steal_clock op to xen_time_ops
xen/pvhvm: run xen_vcpu_setup() for the boot CPU
xen/evtchn: use xen_vcpu_id mapping
xen/events: fifo: use xen_vcpu_id mapping
xen/events: use xen_vcpu_id mapping in events_base
x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
xen: introduce xen_vcpu_id mapping
x86/acpi: store ACPI ids from MADT for future usage
x86/xen: update cpuid.h from Xen-4.7
xen/evtchn: add IOCTL_EVTCHN_RESTRICT
xen-blkback: really don't leak mode property
xen-blkback: constify instance of "struct attribute_group"
xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
xen-blkback: prefer xenbus_scanf() over xenbus_gather()
xen: support runqueue steal time on xen
arm/xen: add support for vm_assist hypercall
xen: update xen headers
xen-pciback: drop superfluous variables
xen-pciback: short-circuit read path used for merging write values
...
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
We now allocate streams from CPU_UP hot-plug path, there are no
context-dependent stream allocations anymore and we can schedule from
zcomp_strm_alloc(). Use GFP_KERNEL directly and drop a gfp_t parameter.
Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add "deflate", "lz4hc", "842" algorithms to the list of known
compression backends. The real availability of those algorithms,
however, depends on the corresponding CONFIG_CRYPTO_FOO config options.
[sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no way to get a string with all the crypto comp algorithms
supported by the crypto comp engine, so we need to maintain our own
backends list. At the same time we additionally need to use
crypto_has_comp() to make sure that the user has requested a compression
algorithm that is recognized by the crypto comp engine. Relying on
/proc/crypto is not an options here, because it does not show
not-yet-inserted compression modules.
Example:
modprobe zram
cat /proc/crypto | grep -i lz4
modprobe lz4
cat /proc/crypto | grep -i lz4
name : lz4
driver : lz4-generic
module : lz4
So the user can't tell exactly if the lz4 is really supported from
/proc/crypto output, unless someone or something has loaded it.
This patch also adds crypto_has_comp() to zcomp_available_show(). We
store all the compression algorithms names in zcomp's `backends' array,
regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
are also supported by crypto engine. This helps user to know the exact
list of compression algorithms that can be used.
Example:
module lz4 is not loaded yet, but is supported by the crypto
engine. /proc/crypto has no information on this module, while
zram's `comp_algorithm' lists it:
cat /proc/crypto | grep -i lz4
cat /sys/block/zram0/comp_algorithm
[lzo] lz4 deflate lz4hc 842
We still use the `backends' array to determine if the requested
compression backend is known to crypto api. This array, however, may not
contain some entries, therefore as the last step we call crypto_has_comp()
function which attempts to insmod the requested compression algorithm to
determine if crypto api supports it. The advantage of this method is that
now we permit the usage of out-of-tree crypto compression modules
(implementing S/W or H/W compression).
[sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This has started as a 'add zlib support' work, but after some thinking I
saw no blockers for a bigger change -- a switch to crypto API.
We don't have an idle zstreams list anymore and our write path now works
absolutely differently, preventing preemption during compression. This
removes possibilities of read paths preempting writes at wrong places
and opens the door for a move from custom LZO/LZ4 compression backends
implementation to a more generic one, using crypto compress API.
This patch set also eliminates the need of a new context-less crypto API
interface, which was quite hard to sell, so we can move along faster.
benchmarks:
(x86_64, 4GB, zram-perf script)
perf reported run-time fio (max jobs=3). I performed fio test with the
increasing number of parallel jobs (max to 3) on a 3G zram device, using
`static' data and the following crypto comp algorithms:
842, deflate, lz4, lz4hc, lzo
the output was:
- test running time (which can tell us what algorithms performs faster)
and
- zram mm_stat (which tells the compressed memory size, max used memory, etc).
It's just for information. for example, LZ4HC has twice the running
time of LZO, but the compressed memory size is: 23592960 vs 34603008
bytes.
test-fio-zram-842
197.907655282 seconds time elapsed
201.623142884 seconds time elapsed
226.854291345 seconds time elapsed
test-fio-zram-DEFLATE
253.259516155 seconds time elapsed
258.148563401 seconds time elapsed
290.251909365 seconds time elapsed
test-fio-zram-LZ4
27.022598717 seconds time elapsed
29.580522717 seconds time elapsed
33.293463430 seconds time elapsed
test-fio-zram-LZ4HC
56.393954615 seconds time elapsed
74.904659747 seconds time elapsed
101.940998564 seconds time elapsed
test-fio-zram-LZO
28.155948075 seconds time elapsed
30.390036330 seconds time elapsed
34.455773159 seconds time elapsed
zram mm_stat-s (max fio jobs=3)
test-fio-zram-842
mm_stat (jobs1): 3221225472 673185792 690266112 0 690266112 0 0
mm_stat (jobs2): 3221225472 673185792 690266112 0 690266112 0 0
mm_stat (jobs3): 3221225472 673185792 690266112 0 690266112 0 0
test-fio-zram-DEFLATE
mm_stat (jobs1): 3221225472 24379392 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 24379392 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 24379392 37761024 0 37761024 0 0
test-fio-zram-LZ4
mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
test-fio-zram-LZ4HC
mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
test-fio-zram-LZO
mm_stat (jobs1): 3221225472 34603008 50335744 0 50335744 0 0
mm_stat (jobs2): 3221225472 34603008 50335744 0 50335744 0 0
mm_stat (jobs3): 3221225472 34603008 50335744 0 50339840 0 0
This patch (of 8):
We don't perform any zstream idle list lookup anymore, so
zcomp_strm_find()/zcomp_strm_release() names are not representative.
Rename to zcomp_stream_get()/zcomp_stream_put().
Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block driver updates from Jens Axboe:
"This branch also contains core changes. I've come to the conclusion
that from 4.9 and forward, I'll be doing just a single branch. We
often have dependencies between core and drivers, and it's hard to
always split them up appropriately without pulling core into drivers
when that happens.
That said, this contains:
- separate secure erase type for the core block layer, from
Christoph.
- set of discard fixes, from Christoph.
- bio shrinking fixes from Christoph, as a followup up to the
op/flags change in the core branch.
- map and append request fixes from Christoph.
- NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
exciting!
- nvme-loop fixes from Arnd.
- removal of ->driverfs_dev from Dan, after providing a
device_add_disk() helper.
- bcache fixes from Bhaktipriya and Yijing.
- cdrom subchannel read fix from Vchannaiah.
- set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.
- set of drbd updates and fixes from Fabian, Lars, and Philipp.
- mg_disk error path fix from Bart.
- user notification for failed device add for loop, from Minfei.
- NVMe in general:
+ NVMe delay quirk from Guilherme.
+ SR-IOV support and command retry limits from Keith.
+ fix for memory-less NUMA node from Masayoshi.
+ use UINT_MAX for discard sectors, from Minfei.
+ cancel IO fixes from Ming.
+ don't allocate unused major, from Neil.
+ error code fixup from Dan.
+ use constants for PSDT/FUSE from James.
+ variable init fix from Jay.
+ fabrics fixes from Ming, Sagi, and Wei.
+ various fixes"
* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
nvme/pci: Provide SR-IOV support
nvme: initialize variable before logical OR'ing it
block: unexport various bio mapping helpers
scsi/osd: open code blk_make_request
target: stop using blk_make_request
block: simplify and export blk_rq_append_bio
block: ensure bios return from blk_get_request are properly initialized
virtio_blk: use blk_rq_map_kern
memstick: don't allow REQ_TYPE_BLOCK_PC requests
block: shrink bio size again
block: simplify and cleanup bvec pool handling
block: get rid of bio_rw and READA
block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
NVMe: don't allocate unused nvme_major
nvme: avoid crashes when node 0 is memoryless node.
nvme: Limit command retries
loop: Make user notify for adding loop device failed
nvme-loop: fix nvme-loop Kconfig dependencies
nvmet: fix return value check in nvmet_subsys_alloc()
...
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
Commit 9d092603cc ("xen-blkback: do not leak mode property") left one
path unfixed; correct this.
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The functions these get passed to have been taking pointers to const
since at least 2.6.16.
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device. Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.
In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support. This will allow to set the DAX capability based on how
mapped device is composed.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API. Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similar to how SCSI and NVMe prepare passthrough requests. This avoids
poking into request internals too much.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem(). Now that
wmb_pmem() is gone, there is little need to keep this annotation.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
There is no error number returned if loop driver fails in function
alloc_disk to add new loop device. Add a correct error number to make
user notify in this case.
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dan writes:
"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk(). See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.
It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1]. We would extend device_add_disk()
to take an attribute_group list.
This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.
[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
Pull block IO fixes from Jens Axboe:
"Three small fixes that have been queued up and tested for this series:
- A bug fix for xen-blkfront from Bob Liu, fixing an issue with
incomplete requests during migration.
- A fix for an ancient issue in retrieving the IO priority of a
different PID than self, preventing that task from going away while
we access it. From Omar.
- A writeback fix from Tahsin, fixing a case where we'd call ihold()
with a zero ref count inode"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix use-after-free in sys_ioprio_get()
writeback: inode cgroup wb switch should not call ihold()
xen-blkfront: save uncompleted reqs in blkfront_resume()
Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
migration, but that's too late after multi-queue was introduced.
After a migrate to another host (which may not have multiqueue support), the
number of rings (block hardware queues) may be changed and the ring and shadow
structure will also be reallocated.
The blkfront_recover() then can't 'save and resubmit' the real
uncompleted reqs because shadow structure have been reallocated.
This patch fixes this issue by moving the 'save' logic out of
blkfront_recover() to earlier place in blkfront_resume().
The 'resubmit' is not changed and still in blkfront_recover().
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
MG_DISK_MAJ is defined as 0 so dynamic block major number
allocation is used by the driver and the assigned major
number is stored in host->major. This patch fixes error
path in mg_probe() to use host->major instead of using
MG_DISK_MAJ.
Cc: unsik Kim <donari75@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For block drivers that specify a parent device, convert them to use
device_add_disk().
This conversion was done with the following semantic patch:
@@
struct gendisk *disk;
expression E;
@@
- disk->driverfs_dev = E;
...
- add_disk(disk);
+ device_add_disk(E, disk);
@@
struct gendisk *disk;
expression E1, E2;
@@
- disk->driverfs_dev = E1;
...
E2 = disk;
...
- add_disk(E2);
+ device_add_disk(E1, E2);
...plus some manual fixups for a few missed conversions.
Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is the third version of the patchset previously sent [1]. I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree. I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2. I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.
Motivation:
While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree. It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.
I think it makes some sense to get rid of them because they are just
making the semantic more unclear. Please note that GFP_REPEAT is
documented as
* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
* _might_ fail. This depends upon the particular VM implementation.
while !costly requests have basically nofail semantic. So one could
reasonably expect that order-0 request with __GFP_REPEAT will not loop
for ever. This is not implemented right now though.
I would like to move on with __GFP_REPEAT and define a better semantic
for it.
$ git grep __GFP_REPEAT origin/master | wc -l
111
$ git grep __GFP_REPEAT | wc -l
36
So we are down to the third after this patch series. The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests. This still needs some double checking which I will do later
after all the simple ones are sorted out.
I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them. Patches should be quite trivial to
review for stupid compile mistakes though. The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.
[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
This patch (of 19):
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we
have the full kernel tree with its usage for apparently order-0
allocations. This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).
Let's simply drop __GFP_REPEAT from those places. This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.
Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
crypto_alloc_hash returns an ERR_PTR(), not NULL.
Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.
Reported-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).
For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.
But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.
Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Also skip the message unless bitmap IO took longer than 5 ms.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This should silence a warning about an empty statement. Thanks to Fabian
Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and
avoids an additional indent level.
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.
Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Scenario, starting with normal operation
Connected Primary/Secondary UpToDate/UpToDate
NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
... more failures happen, secondary loses it's disk,
but eventually is able to re-establish the replication link ...
Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)
We used to just resume/resent suspended requests,
without bumping the UUID.
Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.
Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.
But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".
Consider this sequence (e.g. deployment scenario):
create-md; up;
-> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** HERE things go wrong. ***
block drbd0: role( Secondary -> Primary )
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: Becoming sync target due to disk states.
block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed. Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(
Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.
Result:
... as above ...
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** local promotion being postponed until ... ***
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
...
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
block drbd0: conn( WFSyncUUID -> SyncTarget )
*** ... after the resync handshake ***
block drbd0: role( Secondary -> Primary )
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.
The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.
Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)
Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will support WRITE_SAME, if
* all peers support WRITE_SAME (both in kernel and DRBD version),
* all peer devices support WRITE_SAME
* logical_block_size is identical on all peers.
We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).
If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.
If we are already D_DISKLESS (or D_FAILED), we also return EIO.
Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.
The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.
The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.
The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).
Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).
If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.
But we forgot to call the IO error completion for all pending,
suspended, requests.
While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.
Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.
It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.
This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.
Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.
Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.
We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.
This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.
Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.
Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.
Always wait for explicit resync finished state change notification from
the sync target.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.
We announce discard support,
UNLESS
* we are connected to a peer that does not support TRIM
on the DRBD protocol level. Otherwise, it would either discard, or
do a fallback to zero-out, depending on its backend and configuration.
* our local backend does not support discards,
or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.
The relevant use case is LVM/DM thin.
If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.
While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.
To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".
We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".
Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.
LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.
We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.
At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
Arguably it should behave this way internally, by default,
and we'll try to make that happen.
But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.
Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.
Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.
We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.
We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.
But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.
Reported by Veit Wahlich on drbd-dev.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Regression introduced with 8.4.5
drbd: application writes may set-in-sync in protocol != C
Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.
In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.
With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.
Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).
No impact if >= rtt passes between updates to the same block.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.
When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().
The bitmap write-out for resync -> ahead/behind was missing completely before.
Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use BIO_MAX_PAGES instead and we will remove BIO_MAX_SIZE.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because we define WRITE/READ as REQ_OPs, we cannot do
switch (rq_data_dir(request))
case READ
....
case WRITE
...
without getting warnings about handling other REQ_OPs.
This just has mq_disk do a if/else like it does in other
places.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
After a migrate to another host (which may not have multiqueue
support), the number of rings (block hardware queues)
may be changed and the ring info structure will also be reallocated.
This patch fixes two related bugs:
* call blk_mq_update_nr_hw_queues() to make blk-core know the number
of hardware queues have been changed.
* Don't store rinfo pointer to hctx->driver_data, because rinfo may be
reallocated so use hctx->queue_num to get the rinfo structure instead.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Sometimes blkfront may twice receive blkback_changed() notification
(XenbusStateConnected) after migration, which will cause
talk_to_blkback() to be called twice too and confuse xen-blkback.
The flow is as follow:
blkfront blkback
blkfront_resume()
> talk_to_blkback()
> Set blkfront to XenbusStateInitialised
front changed()
> Connect()
> Set blkback to XenbusStateConnected
blkback_changed()
> Skip talk_to_blkback()
because frontstate == XenbusStateInitialised
> blkfront_connect()
> Set blkfront to XenbusStateConnected
-----
And here we get another XenbusStateConnected notification leading
to:
-----
blkback_changed()
> because now frontstate != XenbusStateInitialised
talk_to_blkback() is also called again
> blkfront state changed from
XenbusStateConnected to XenbusStateInitialised
(Which is not correct!)
front_changed():
> Do nothing because blkback
already in XenbusStateConnected
Now blkback is in XenbusStateConnected but blkfront is still
in XenbusStateInitialised - leading to no disks.
Poking of the XenbusStateConnected state is allowed (to deal with
block disk change) and has to be dealt with. The most likely
cause of this bug are custom udev scripts hooking up the disks
and then validating the size.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We were passing in &nbd for the private data in debugfs_create_file() for the
flags entry. We expect it to just be nbd, fix this so we get proper output from
this debugfs entry.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The last patch added a REQ_OP_FLUSH for request_fn drivers
and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
will be used by file systems and make_request_fn drivers so
they can send a write/flush combo.
This patch drops xen's use of REQ_FLUSH to track if it supports
REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Juergen Gross <kernel@pfupf.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The req operation REQ_OP is separated from the rq_flag_bits
definition. This converts the block layer drivers to
use req_op to get the op from the request struct.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have xen
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have drbd
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op
These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.
This patch converts the drivers and cgroup to use the
op_is_write helper. This should just cover the simple
cases. I did dm, md and bcache in their own patches
because they were more involved.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
- Until now, dax has been disabled if media errors were found on
any device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition
is page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent reads/writes
would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX related to
zeroing, writeback, and some size checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
+yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
9R6JdrAj0Sc+FBa+cGzH
=v6Of
-----END PGP SIGNATURE-----
Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"
* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks. This is for the new rados_watcherrcb_t, which would be
called from a notify callback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *. This reduces noise in rbd_dev_header_name().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put. Moving the next_obj_request cursor is then a use-after-free on
img_request.
It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
debug_stat sysfs is read-only and represents various debugging data that
zram developers may need. This file is not meant to be used by anyone
else: its content is not documented and will change any time w/o any
notice. Therefore, the output of debug_stat file contains a version
string. To avoid any confusion, we will increase the version number
every time we modify the output.
At the moment this file exports only one value -- the number of
re-compressions, IOW, the number of times compression fast path has
failed. This stat is temporary any will be useful in case if any
per-cpu compression streams regressions will be reported.
Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the internal part of max_comp_streams interface, since we
switched to per-cpu streams. We will keep RW max_comp_streams attr
around, because:
a) we may (silently) switch back to idle compression streams list and
don't want to disturb user space
b) max_comp_streams attr must wait for the next 'lay off cycle'; we
give user space 2 years to adjust before we remove/downgrade the attr,
and there are already several attrs scheduled for removal in 4.11, so
it's too late for max_comp_streams.
This slightly change a user visible behaviour:
- First, reading from max_comp_stream file now will always return the
number of online CPUs.
- Second, writing to max_comp_stream will not take any effect.
Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
zs_create_pool(), so we can be more flexible, but, more importantly, we
need this to switch zram to per-cpu compression streams -- zram will try
to allocate handle with preemption disabled in a fast path and switch to
a slow path (using different gfp mask) if the fast one has failed.
Apart from that, this also align zs_malloc() interface with zspool/zbud.
[sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.
[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ If a mapping overlaps a bad sector fail the request.
2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull block driver updates from Jens Axboe:
"On top of the core pull request, this is the drivers pull request for
this merge window. This contains:
- Switch drivers to the new write back cache API, and kill off the
flush flags. From me.
- Kill the discard support for the STEC pci-e flash driver. It's
trivially broken, and apparently unmaintained, so it's safer to
just remove it. From Jeff Moyer.
- A set of lightnvm updates from the usual suspects (Matias/Javier,
and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
Tao.
- A set of updates for NVMe:
- Turn the controller state management into a proper state
machine. From Christoph.
- Shuffling of code in preparation for NVMe-over-fabrics, also
from Christoph.
- Cleanup of the command prep part from Ming Lin.
- Rewrite of the discard support from Ming Lin.
- Deadlock fix for namespace removal from Ming Lin.
- Use the now exported blk-mq tag helper for IO termination.
From Sagi.
- Various little fixes from Christoph, Guilherme, Keith, Ming
Lin, Wang Sheng-Hui.
- Convert mtip32xx to use the now exported blk-mq tag iter function,
from Keith"
* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
lightnvm: reserved space calculation incorrect
lightnvm: rename nr_pages to nr_ppas on nvm_rq
lightnvm: add is_cached entry to struct ppa_addr
lightnvm: expose gennvm_mark_blk to targets
lightnvm: remove mgt targets on mgt removal
lightnvm: pass dma address to hardware rather than pointer
lightnvm: do not assume sequential lun alloc.
nvme/lightnvm: Log using the ctrl named device
lightnvm: rename dma helper functions
lightnvm: enable metadata to be sent to device
lightnvm: do not free unused metadata on rrpc
lightnvm: fix out of bound ppa lun id on bb tbl
lightnvm: refactor set_bb_tbl for accepting ppa list
lightnvm: move responsibility for bad blk mgmt to target
lightnvm: make nvm_set_rqd_ppalist() aware of vblks
lightnvm: remove struct factory_blks
lightnvm: refactor device ops->get_bb_tbl()
lightnvm: introduce nvm_for_each_lun_ppa() macro
lightnvm: refactor dev->online_target to global nvm_targets
lightnvm: rename nvm_targets to nvm_tgt_type
...
Pull core block layer updates from Jens Axboe:
"This is the core block IO changes for this merge window. Nothing
earth shattering in here, it's mostly just fixes. In detail:
- Fix for a long standing issue where wrong ordering in blk-mq caused
order_to_size() to spew a warning. From Bart.
- Async discard support from Christoph. Basically just splitting our
sync interface into a submit + wait part.
- Add a cleaner interface for flagging whether a device has a write
back cache or not. We've previously overloaded blk_queue_flush()
with this, but let's make it more explicit. Drivers cleaned up and
updated in the drivers pull request. From me.
- Fix for a double check for whether IO accounting is enabled or not.
From Michael Callahan.
- Fix for the async discard from Mike Snitzer, reinstating the early
EOPNOTSUPP return if the device doesn't support discards.
- Also from Mike, export bio_inc_remaining() so dm can drop it's
private copy of it.
- From Ming Lin, add support for passing in an offset for request
payloads.
- Tag function export from Sagi, which will be used in NVMe in the
drivers pull.
- Two blktrace related fixes from Shaohua.
- Propagate NOMERGE flag when making a request from a bio, also from
Shaohua.
- An optimization to not parse cgroup paths in blk-throttle, if we
don't need to. From Shaohua"
* 'for-4.7/core' of git://git.kernel.dk/linux-block:
blk-mq: fix undefined behaviour in order_to_size()
blk-throttle: don't parse cgroup path if trace isn't enabled
blktrace: add missed mask name
blktrace: delete garbage for message trace
block: make bio_inc_remaining() interface accessible again
block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard
block: Minor blk_account_io_start usage cleanup
block: add __blkdev_issue_discard
block: remove struct bio_batch
block: copy NOMERGE flag from bio to request
block: add ability to flag write back caching on a device
blk-mq: Export tagset iter function
block: add offset in blk_add_request_payload()
writeback: Fix performance regression in wb_over_bg_thresh()
The attribute 0 is never used in drbd, so let's use it as pad attribute
in netlink messages. This minimizes the patch.
Note that this patch is only compile-tested.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A while ago, commit 9875201e10 ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().
A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup(). Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
Call Trace:
[<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd]
[<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph]
[<ffffffff8109d5db>] process_one_work+0x17b/0x470
[<ffffffff8109e3ab>] worker_thread+0x11b/0x400
[<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[<ffffffff810a5acf>] kthread+0xcf/0xe0
[<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81645dd8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd]
If an error occurs during rbd map, we have to error out, potentially
tearing down a watch. Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:
Assertion failure in rbd_dev_header_info() at line 4722:
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
Call Trace:
[<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150
[<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390
[<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290
[<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0
[<ffffffff81107799>] process_one_work+0x689/0x1a80
[<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80
[<ffffffff81132065>] ? finish_task_switch+0x225/0x640
[<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
[<ffffffff81108c69>] worker_thread+0xd9/0x1320
[<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80
[<ffffffff8111b02d>] kthread+0x21d/0x2e0
[<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
[<ffffffff82022802>] ret_from_fork+0x22/0x40
[<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30
To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section. The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.
Fixes: http://tracker.ceph.com/issues/15490
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Simply creating a file system on an skd device, followed by mount and
fstrim will result in errors in the logs and then a BUG(). Let's remove
discard support from that driver. As far as I can tell, it hasn't
worked right since it was merged. This patch also has a side-effect of
cleaning up an unintentional shadowed declaration inside of
skd_end_request.
I tested to ensure that I can still do I/O to the device using xfstests
./check -g quick. I didn't do anything more extensive than that,
though.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A few fixes for the current series. This contains:
- Two fixes for NVMe:
One fixes a reset race that can be triggered by repeated
insert/removal of the module.
The other fixes an issue on some platforms, where we get probe
timeouts since legacy interrupts isn't working. This used not to
be a problem since we had the worker thread poll for completions,
but since that was killed off, it means those poor souls can't
successfully probe their NVMe device. Use a proper IRQ check and
probe (msi-x -> msi ->legacy), like most other drivers to work
around this. Both from Keith.
- A loop corruption issue with offset in iters, from Ming Lei.
- A fix for not having the partition stat per cpu ref count
initialized before sending out the KOBJ_ADD, which could cause user
space to access the counter prior to initialization. Also from
Ming Lei.
- A fix for using the wrong congestion state, from Kaixu Xia"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: loop: fix filesystem corruption in case of aio/dio
NVMe: Always use MSI/MSI-x interrupts
NVMe: Fix reset/remove race
writeback: fix the wrong congested state variable definition
block: partition: initialize percpuref before sending out KOBJ_ADD
Starting from commit e36f620428(block: split bios to max possible length),
block core starts to split bio in the middle of bvec.
Unfortunately loop dio/aio doesn't consider this situation, and
always treat 'iter.iov_offset' as zero. Then filesystem corruption
is observed.
This patch figures out the offset of the base bvevc via
'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset
to iov iterator.
Fixes: e36f620428 (block: split bios to max possible length)
Cc: Keith Busch <keith.busch@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org (4.5)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver calls it with 0 for flags, since it doesn't have a writeback
cache. Just remove the call, as it's a no-op right now.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only a single tags array anyway.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We could kmalloc() the payload, so need the offset in page.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Ceph fix from Sage Weil:
"This just fixes a few remaining memory allocations in RBD to use
GFP_NOIO instead of GFP_ATOMIC"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use GFP_NOIO consistently for request allocations
As of 5a60e87603, RBD object request
allocations are made via rbd_obj_request_create() with GFP_NOIO.
However, subsequent OSD request allocations in rbd_osd_req_create*()
use GFP_ATOMIC.
With heavy page cache usage (e.g. OSDs running on same host as krbd
client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been
observed to fail, where direct reclaim would have allowed GFP_NOIO
allocations to succeed.
Cc: stable@vger.kernel.org # 3.18+
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Neil Brown <neilb@suse.com>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.
r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This avoids defining large array of r_reply_op_{len,result} in
in struct ceph_osd_request.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull block fixes from Jens Axboe:
"Final round of fixes for this merge window - some of this has come up
after the initial pull request, and some of it was put in a post-merge
branch before the merge window.
This contains:
- Fix for a bad check for an error on dma mapping in the mtip32xx
driver, from Alexey Khoroshilov.
- A set of fixes for lightnvm, from Javier, Matias, and Wenwei.
- An NVMe completion record corruption fix from Marta, ensuring that
we read things in the right order.
- Two writeback fixes from Tejun, marked for stable@ as well.
- A blk-mq sw queue iterator fix from Thomas, fixing an oops for
sparse CPU maps. They hit this in the hot plug/unplug rework"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: avoid cqe corruption when update at the same time as read
writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
blk-mq: Use proper cpumask iterator
mtip32xx: fix checks for dma mapping errors
lightnvm: do not load L2P table if not supported
lightnvm: do not reserve lun on l2p loading
nvme: lightnvm: return ppa completion status
lightnvm: add a bitmap of luns
lightnvm: specify target's logical address area
null_blk: add lightnvm null_blk device to the nullb_list
This adds basic polling support for vhost.
Reworks virtio to optionally use DMA API, fixing it on Xen.
Balloon stats gained a new entry.
Using the new napi_alloc_skb speeds up virtio net.
virtio blk stats can now be read while another VCPU
us busy inflating or deflating the balloon.
Plus misc cleanups in various places.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW7qRJAAoJECgfDbjSjVRpVNoH/A7z+lZ6nooSJ9fUBtAAlwit
mE1VKi8g0G6naV1NVLFVe7hPAejExGiHfR3ZUrVoenJKj2yeW/DFojFC10YR/KTe
ac7Imuc+owA3UOE/QpeGBs59+EEWKTZUYt6r8HSJVwoodeosw9v2ecP/Iwhbax8H
a4V3HqOADjKnHg73R9o3u+bAgA1GrGYHeK0AfhCBSTNwlPdxkvf0463HgfOpM4nl
/sNoFWO3vOyekk+loIk+jpmWVIoIfG2NFzW4lPwEPkfqUBX7r0ei/NR23hIqHL7r
QZ6vMj1Ew9qctUONbJu4kXjuV2Vk9NhxwbDjoJtm8plKL2hz2prJynUEogkHh2g=
=VMD0
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost updates from Michael Tsirkin:
"New features, performance improvements, cleanups:
- basic polling support for vhost
- rework virtio to optionally use DMA API, fixing it on Xen
- balloon stats gained a new entry
- using the new napi_alloc_skb speeds up virtio net
- virtio blk stats can now be read while another VCPU is busy
inflating or deflating the balloon
plus misc cleanups in various places"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_net: replace netdev_alloc_skb_ip_align() with napi_alloc_skb()
vhost_net: basic polling support
vhost: introduce vhost_vq_avail_empty()
vhost: introduce vhost_has_work()
virtio_balloon: Allow to resize and update the balloon stats in parallel
virtio_balloon: Use a workqueue instead of "vballoon" kthread
virtio/s390: size of SET_IND payload
virtio/s390: use dev_to_virtio
vhost: rename vhost_init_used()
vhost: rename cross-endian helpers
virtio_blk: VIRTIO_BLK_F_WCE->VIRTIO_BLK_F_FLUSH
vring: Use the DMA API on Xen
virtio_pci: Use the DMA API if enabled
virtio_mmio: Use the DMA API if enabled
virtio: Add improved queue allocation API
virtio_ring: Support DMA APIs
vring: Introduce vring_use_dma_api()
s390/dma: Allow per device dma ops
alpha/dma: use common noop dma ops
dma: Provide simple noop dma ops
Merge second patch-bomb from Andrew Morton:
- a couple of hotfixes
- the rest of MM
- a new timer slack control in procfs
- a couple of procfs fixes
- a few misc things
- some printk tweaks
- lib/ updates, notably to radix-tree.
- add my and Nick Piggin's old userspace radix-tree test harness to
tools/testing/radix-tree/. Matthew said it was a godsend during the
radix-tree work he did.
- a few code-size improvements, switching to __always_inline where gcc
screwed up.
- partially implement character sets in sscanf
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
sscanf: implement basic character sets
lib/bug.c: use common WARN helper
param: convert some "on"/"off" users to strtobool
lib: add "on"/"off" support to kstrtobool
lib: update single-char callers of strtobool()
lib: move strtobool() to kstrtobool()
include/linux/unaligned: force inlining of byteswap operations
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
usb: common: convert to use match_string() helper
ide: hpt366: convert to use match_string() helper
ata: hpt366: convert to use match_string() helper
power: ab8500: convert to use match_string() helper
power: charger_manager: convert to use match_string() helper
drm/edid: convert to use match_string() helper
pinctrl: convert to use match_string() helper
device property: convert to use match_string() helper
lib/string: introduce match_string() helper
radix-tree tests: add test for radix_tree_iter_next
radix-tree tests: add regression3 test
...
exec_drive_taskfile() checks for dma mapping errors by comparison
returned address with zero, while pci_dma_mapping_error() should be used.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Jens Axboe <axboe@fb.com>
After register null_blk devices into lightnvm, we forget
to add these devices to the the nullb_list, makes them
invisible to the null_blk driver.
Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Fixes: a514379b0c ("null_blk: oops when initializing without lightnvm")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for this merge window. It sits
on top of for-4.6/core, that was just sent out.
This contains:
- A set of fixes for lightnvm. One from Alan, fixing an overflow,
and the rest from the usual suspects, Javier and Matias.
- A set of fixes for nbd from Markus and Dan, and a fixup from Arnd
for correct usage of the signed 64-bit divider.
- A set of bug fixes for the Micron mtip32xx, from Asai.
- A fix for the brd discard handling from Bart.
- Update the maintainers entry for cciss, since that hardware has
transferred ownership.
- Three bug fixes for bcache from Eric Wheeler.
- Set of fixes for xen-blk{back,front} from Jan and Konrad.
- Removal of the cpqarray driver. It has been disabled in Kconfig
since 2013, and we were initially scheduled to remove it in 3.15.
- Various updates and fixes for NVMe, with the most important being:
- Removal of the per-device NVMe thread, replacing that with a
watchdog timer instead. From Christoph.
- Exposing the namespace WWID through sysfs, from Keith.
- Set of cleanups from Ming Lin.
- Logging the controller device name instead of the underlying
PCI device name, from Sagi.
- And a bunch of fixes and optimizations from the usual suspects
in this area"
* 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits)
NVMe: Expose ns wwid through single sysfs entry
drivers:block: cpqarray clean up
brd: Fix discard request processing
cpqarray: remove it from the kernel
cciss: update MAINTAINERS
NVMe: Remove unused sq_head read in completion path
bcache: fix cache_set_flush() NULL pointer dereference on OOM
bcache: cleaned up error handling around register_cache()
bcache: fix race of writeback thread starting before complete initialization
NVMe: Create discard zero quirk white list
nbd: use correct div_s64 helper
mtip32xx: remove unneeded variable in mtip_cmd_timeout()
lightnvm: generalize rrpc ppa calculations
lightnvm: remove struct nvm_dev->total_blocks
lightnvm: rename ->nr_pages to ->nr_sects
lightnvm: update closed list outside of intr context
xen/blback: Fit the important information of the thread in 17 characters
lightnvm: fold get bb tbl when using dual/quad plane mode
lightnvm: fix up nonsensical configure overrun checking
xen-blkback: advertise indirect segment support earlier
...
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.
In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.
In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.6:
API:
- Convert remaining crypto_hash users to shash or ahash, also convert
blkcipher/ablkcipher users to skcipher.
- Remove crypto_hash interface.
- Remove crypto_pcomp interface.
- Add crypto engine for async cipher drivers.
- Add akcipher documentation.
- Add skcipher documentation.
Algorithms:
- Rename crypto/crc32 to avoid name clash with lib/crc32.
- Fix bug in keywrap where we zero the wrong pointer.
Drivers:
- Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver.
- Add PIC32 hwrng driver.
- Support BCM6368 in bcm63xx hwrng driver.
- Pack structs for 32-bit compat users in qat.
- Use crypto engine in omap-aes.
- Add support for sama5d2x SoCs in atmel-sha.
- Make atmel-sha available again.
- Make sahara hashing available again.
- Make ccp hashing available again.
- Make sha1-mb available again.
- Add support for multiple devices in ccp.
- Improve DMA performance in caam.
- Add hashing support to rockchip"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits)
crypto: qat - remove redundant arbiter configuration
crypto: ux500 - fix checks of error code returned by devm_ioremap_resource()
crypto: atmel - fix checks of error code returned by devm_ioremap_resource()
crypto: qat - Change the definition of icp_qat_uof_regtype
hwrng: exynos - use __maybe_unused to hide pm functions
crypto: ccp - Add abstraction for device-specific calls
crypto: ccp - CCP versioning support
crypto: ccp - Support for multiple CCPs
crypto: ccp - Remove check for x86 family and model
crypto: ccp - memset request context to zero during import
lib/mpi: use "static inline" instead of "extern inline"
lib/mpi: avoid assembler warning
hwrng: bcm63xx - fix non device tree compatibility
crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode.
crypto: qat - The AE id should be less than the maximal AE number
lib/mpi: Endianness fix
crypto: rockchip - add hash support for crypto engine in rk3288
crypto: xts - fix compile errors
crypto: doc - add skcipher API documentation
crypto: doc - update AEAD AD handling
...
gcc-6.0 found an ancient bug in the paride driver, which had a
"module_param(verbose, bool, 0);" since before 2.6.12, but actually uses
it to accept '0', '1' or '2' as arguments:
drivers/block/paride/pd.c: In function 'pd_init_dev_parms':
drivers/block/paride/pd.c:298:29: warning: comparison of constant '1' with boolean expression is always false [-Wbool-compare]
#define DBMSG(msg) ((verbose>1)?(msg):NULL)
In 2012, Rusty did a cleanup patch that also changed the type of the
variable to 'bool', which introduced what is now a gcc warning.
This changes the type back to 'int' and adapts the module_param() line
instead, so it should work as documented in case anyone ever cares about
running the ancient driver with debugging.
Fixes: 90ab5ee941 ("module_param: make bool parameters really bool (drivers & misc)")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Rusty Russell <rusty@rustcorp.com.au>
Cc: Tim Waugh <tim@cyberelk.net>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d436641439 ("cpqarray: remove it from the kernel") removes the
Kconfig option BLK_CPQ_DA and cpqarray.
Remove the dead build rule in the Makefile.
Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Avoid that discard requests with size => PAGE_SIZE fail with
-EIO. Refuse discard requests if the discard size is not a
multiple of the page size.
Fixes: 2dbe549576 ("brd: Refuse improperly aligned discard requests")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Jan Kara <jack@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliot <elliott@hp.com>
Cc: stable <stable@vger.kernel.org> # v4.4+
Signed-off-by: Jens Axboe <axboe@fb.com>
We disabled the ability to enable this driver back in October of 2013,
we should be able to safely remove it at this point. The initial goal
was to remove it in 3.15, so now is the time.
Signed-off-by: Jens Axboe <axboe@fb.com>
The do_div() macro now checks its arguments for the correct type,
and refuses anything other than u64, so we get a warning about
nbd_ioctl passing in an loff_t:
drivers/block/nbd.c: In function '__nbd_ioctl':
drivers/block/nbd.c:757:77: error: comparison of distinct pointer types lacks a cast [-Werror]
This changes the nbd code to use div_s64() instead, which takes
a signed argument.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 37091fdd83 ("nbd: Create size change events for userspace")
Signed-off-by: Jens Axboe <axboe@fb.com>
The processes names are truncated to 17, while we had the length
of the process as name 20 - which meant that while we filled
it out with various details - the last 3 characters (which had
the queue number) never surfaced to the user-space.
To simplify this and be able to fit the device name, domain id,
and the queue number we remove the 'blkback' from the name.
Prior to this patch the device name is "blkback.<domid>.<name>"
for example: blkback.8.xvda, blkback.11.hda.
With the multiqueue block backend we add "-%d" for the queue.
But sadly this is already way past the limit so it gets stripped.
Possible solution had been identified by Ian:
http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg03516.html
"
If you are pressed for space then the "xvd" is probably a bit redundant
in a string which starts blkbk.
The guest may not even call the device xvdN (iirc BSD has another
prefix) any how, so having blkback say so seems of limited use anyway.
Since this seems to not include a partition number how does this work in
the split partition scheme? (i.e. one where the guest is given xvda1 and
xvda2 rather than xvda with a partition table)
[It will be 'blkback.8.xvda1', and 'blkback.11.xvda2']
Perhaps something derived from one of the schemes in
http://xenbits.xen.org/docs/unstable/misc/vbd-interface.txt might be a
better fit?
After a bit of discussion (see
http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg01588.html)
we settled on dropping the "blback" part.
This will make it possible to have the <domid>.<name>-<queue>:
[1.xvda-0]
[1.xvda-1]
And we enough space to make it go up to:
[32100.xvdfg9-5]
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There's no reason to defer this until the connect phase, and in fact
there are frontend implementations expecting this to be available
earlier. Move it into the probe function.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
"max" is rather ambiguous and carries pretty little meaning, the more
that there are also "max_queues" and "max_ring_page_order". Make this
"max_indirect_segments" instead, and at once change the type from int
to uint (to match the respective variable's type).
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fail all pending requests after surprise removal of a drive.
Signed-off-by: Vignesh Gunasekaran <vgunasekaran@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Allow device initialization to finish gracefully when it is in
FTL rebuild failure state. Also, recover device out of this state
after successfully secure erasing it.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Vignesh Gunasekaran <vgunasekaran@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Flush inflight IOs using fsync_bdev() when the device is safely
removed. Also, block further IOs in device open function.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
When FTL rebuild is in progress, alloc_disk() initializes the disk
but device node will be created by add_disk() only after successful
completion of FTL rebuild. So, skip deletion of device node in
removal path when FTL rebuild is in progress.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Prevent standby immediate command from being issued in remove,
suspend and shutdown paths, while drive is in FTL rebuild process.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Vignesh Gunasekaran <vgunasekaran@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Print exact time when an internal command is interrupted.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove setting and clearing MTIP_PF_EH_ACTIVE_BIT flag in
mtip_handle_tfe() as they are redundant. Also avoid waking
up service thread from mtip_handle_tfe() because it is
already woken up in case of taskfile error.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Rajesh Kumar Sambandam <rsambandam@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Service thread does not detect the need for taskfile error hanlding. Fixed the
flag condition to process taskfile error.
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Latest virtio spec says the feature bit name is VIRTIO_BLK_F_FLUSH,
VIRTIO_BLK_F_WCE is the legacy name. virtio blk header says exactly the
reverse - fix that and update driver code to match.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The userspace needs to know when nbd devices are ready for use.
Currently no events are created for the userspace which doesn't work for
systemd.
See the discussion here: https://github.com/systemd/systemd/pull/358
This patch uses a central point to setup the nbd-internal sizes. A ioctl
to set a size does not lead to a visible size change. The size of the
block device will be kept at 0 until nbd is connected. As soon as it
connects, the size will be changed to the real value and a uevent is
created. When disconnecting, the blockdevice is set to 0 size and
another uevent is generated.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
If the LightNVM subsystem is not compiled into the kernel, and the
null_blk device driver requests lightnvm to be initialized. The call to
nvm_register fails and the null_add_dev function cleans up the
initialization. However, at this point the null block device has
already been added to the nullb_list and thus a second cleanup will
occur when the function has returned, that leads to a double call to
blk_cleanup_queue.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
In case /dev/fdX is open with O_NDELAY / O_NONBLOCK, floppy_open() immediately
succeeds, without performing any further media / controller preparations.
That's "correct" wrt. the NODELAY flag, but is hardly correct wrt. the rest
of the floppy driver, that is not really O_NONBLOCK ready, at all. Therefore
it's not too surprising, that subsequent attempts to work with the
filedescriptor produce bad results. Namely, syzkaller tool has been able
to livelock mmap() on the returned fd to keep waiting on the page unlock
bit forever.
Quite frankly, I have trouble defining what non-blocking behavior would be for
floppies. Is waiting ages for the driver to actually succeed reading a sector
blocking operation? Is waiting for drive motor to start blocking operation? How
about in case of virtualized floppies?
One option would be returning EWOULDBLOCK in case O_NDLEAY / O_NONBLOCK is
being passed to open(). That has a theoretical potential of breaking some
arcane and archaic userspace though.
Let's take a more conservative aproach, and accept the O_NDLEAY flag, and let
the driver behave as usual.
While at it, clean up a bit handling of !(mode & (FMODE_READ|FMODE_WRITE))
case and return EINVAL instead of succeeding as well.
Spotted by syzkaller tool.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Make the "Attempted send on closed socket" error messages generated in
nbd_request_handler() ratelimited.
When the nbd socket is shutdown, the nbd_request_handler() function emits
an error message for every request remaining in its queue. If the queue
is large, this will spam a large amount of messages to the log. There's
no need for a separate error message for each request, so this patch
ratelimits it.
In the specific case this was found, the system was virtual and the error
messages were logged to the serial port, which overwhelmed it.
Fixes: 4d48a542b4 ("nbd: fix I/O hang on disconnected nbds")
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
nbd changes properties of the blockdevice depending on flags that were
received. This patch moves this flag parsing into a separate function
nbd_parse_flags().
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Group all variables that are reset after a disconnect into reset
functions. This patch adds two of these functions, nbd_reset() and
nbd_bdev_reset().
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
It may be useful to know in the client that a connection timed out. The
current code returns success for a timeout.
This patch reports the error code -ETIMEDOUT for a timeout.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
As discussed on the mailing list, the usage of signals for timeout
handling has a lot of potential issues. The nbd driver used for some
time signals for timeouts. These signals where able to get the threads
out of the blocking socket operations.
This patch removes all signal usage and uses a socket shutdown instead.
The socket descriptor itself is cleared later when the whole nbd device
is closed.
The tasks_lock is removed as we do not depend on this anymore. Instead
a new lock for the socket is introduced so we can safely work with the
socket in the timeout handler outside of the two main threads.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
System block allows the device to initialize with its configured media
manager. The system blocks is written to disk, and read again when media
manager is determined. For this to work, the backend must store the
data. Device drivers, such as null_blk, does not have any backend
storage. This patch allows the media manager to be initialized without a
storage backend.
It also fix incorrect configuration of capabilities in null_blk, as it
does not support get/set bad block interface.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Static checker complains about the implemented error handling. It is
indeed wrong. We don't care about the return values of created debugfs
files.
We only have to check the return values of created dirs for NULL
pointer. If we use a null pointer as parent directory for files, this
may lead to debugfs files in wrong places.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
floppy_revalidate() doesn't perform any error handling on lock_fdc()
result. lock_fdc() might actually be interrupted by a signal (it waits for
fdc becoming non-busy interruptibly). In such case, floppy_revalidate()
proceeds as if it had claimed the lock, but it fact it doesn't.
In case of multiple threads trying to open("/dev/fdX"), this leads to
serious corruptions all over the place, because all of a sudden there is
no critical section protection (that'd otherwise be guaranteed by locked
fd) whatsoever.
While at this, fix the fact that the 'interruptible' parameter to
lock_fdc() doesn't make any sense whatsoever, because we always wait
interruptibly anyway.
Most of the lock_fdc() callsites do properly handle error (and propagate
EINTR), but floppy_revalidate() and floppy_check_events() don't. Fix this.
Spotted by 'syzkaller' tool.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Need to reallocate ring info in the resume path, because info->rinfo was freed
in blkif_free(). And 'multi-queue-max-queues' backend reports may have been
changed.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch replaces uses of the long obsolete hash interface with
either shash (for non-SG users) or ahash.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull Ceph updates from Sage Weil:
"The two main changes are aio support in CephFS, and a series that
fixes several issues in the authentication key timeout/renewal code.
On top of that are a variety of cleanups and minor bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: remove outdated comment
libceph: kill off ceph_x_ticket_handler::validity
libceph: invalidate AUTH in addition to a service ticket
libceph: fix authorizer invalidation, take 2
libceph: clear messenger auth_retry flag if we fault
libceph: fix ceph_msg_revoke()
libceph: use list_for_each_entry_safe
ceph: use i_size_{read,write} to get/set i_size
ceph: re-send AIO write request when getting -EOLDSNAP error
ceph: Asynchronous IO support
ceph: Avoid to propagate the invalid page point
ceph: fix double page_unlock() in page_mkwrite()
rbd: delete an unnecessary check before rbd_dev_destroy()
libceph: use list_next_entry instead of list_entry_next
ceph: ceph_frag_contains_value can be boolean
ceph: remove unused functions in ceph_frag.h
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)
- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration
There are many locations that do
if (memory_was_allocated_by_vmalloc)
vfree(ptr);
else
kfree(ptr);
but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Boris Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull lightnvm fixes and updates from Jens Axboe:
"This should have been part of the drivers branch, but it arrived a bit
late and wasn't based on the official core block driver branch. So
they got a small scolding, but got a pass since it's still new. Hence
it's in a separate branch.
This is mostly pure fixes, contained to lightnvm/, and minor feature
additions"
* 'for-4.5/lightnvm' of git://git.kernel.dk/linux-block: (26 commits)
lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM
lightnvm: introduce factory reset
lightnvm: use system block for mm initialization
lightnvm: introduce ioctl to initialize device
lightnvm: core on-disk initialization
lightnvm: introduce mlc lower page table mappings
lightnvm: add mccap support
lightnvm: manage open and closed blocks separately
lightnvm: fix missing grown bad block type
lightnvm: reference rrpc lun in rrpc block
lightnvm: introduce nvm_submit_ppa
lightnvm: move rq->error to nvm_rq->error
lightnvm: support multiple ppas in nvm_erase_ppa
lightnvm: move the pages per block check out of the loop
lightnvm: sectors first in ppa list
lightnvm: fix locking and mempool in rrpc_lun_gc
lightnvm: put block back to gc list on its reclaim fail
lightnvm: check bi_error in gc
lightnvm: return the get_bb_tbl return value
lightnvm: refactor end_io functions for sync
...
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for 4.5, with the exception of
NVMe, which is in a separate branch and will be posted after this one.
This pull request contains:
- A set of bcache stability fixes, which have been acked by Kent.
These have been used and tested for more than a year by the
community, so it's about time that they got in.
- A set of drbd updates from the drbd team (Andreas, Lars, Philipp)
and Markus Elfring, Oleg Drokin.
- A set of fixes for xen blkback/front from the usual suspects, (Bob,
Konrad) as well as community based fixes from Kiri, Julien, and
Peng.
- A 2038 time fix for sx8 from Shraddha, with a fix from me.
- A small mtip32xx cleanup from Zhu Yanjun.
- A null_blk division fix from Arnd"
* 'for-4.5/drivers' of git://git.kernel.dk/linux-block: (71 commits)
null_blk: use sector_div instead of do_div
mtip32xx: restrict variables visible in current code module
xen/blkfront: Fix crash if backend doesn't follow the right states.
xen/blkback: Fix two memory leaks.
xen/blkback: make st_ statistics per ring
xen/blkfront: Handle non-indirect grant with 64KB pages
xen-blkfront: Introduce blkif_ring_get_request
xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
xen/blkback: Free resources if connect_ring failed.
xen/blocks: Return -EXX instead of -1
xen/blkback: make pool of persistent grants and free pages per-queue
xen/blkback: get the number of hardware queues/rings from blkfront
xen/blkback: pseudo support for multi hardware queues/rings
xen/blkback: separate ring information out of struct xen_blkif
xen/blkfront: correct setting for xen_blkif_max_ring_order
xen/blkfront: make persistent grants pool per-queue
xen/blkfront: Remove duplicate setting of ->xbdev.
xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.
xen/blkfront: negotiate number of queues/rings to be used with backend
xen/blkfront: split per device io_lock
...
The rbd_dev_destroy() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull core block updates from Jens Axboe:
"We don't have a lot of core changes this time around, it's mostly in
drivers, which will come in a subsequent pull.
The cores changes include:
- blk-mq
- Prep patch from Christoph, changing blk_mq_alloc_request() to
take flags instead of just using gfp_t for sleep/nosleep.
- Doc patch from me, clarifying the difference between legacy
and blk-mq for timer usage.
- Fixes from Raghavendra for memory-less numa nodes, and a reuse
of CPU masks.
- Cleanup from Geliang Tang, using offset_in_page() instead of open
coding it.
- From Ilya, rename request_queue slab to it reflects what it holds,
and a fix for proper use of bdgrab/put.
- A real fix for the split across stripe boundaries from Keith. We
yanked a broken version of this from 4.4-rc final, this one works.
- From Mike Krinkin, emit a trace message when we split.
- From Wei Tang, two small cleanups, not explicitly clearing memory
that is already cleared"
* 'for-4.5/core' of git://git.kernel.dk/linux-block:
block: use bd{grab,put}() instead of open-coding
block: split bios to max possible length
block: add call to split trace point
blk-mq: Avoid memoryless numa node encoded in hctx numa_node
blk-mq: Reuse hardware context cpumask for tags
blk-mq: add a flags parameter to blk_mq_alloc_request
Revert "blk-flush: Queue through IO scheduler when flush not required"
block: clarify blk_add_timer() use case for blk-mq
bio: use offset_in_page macro
block: do not initialise statics to 0 or NULL
block: do not initialise globals to 0 or NULL
block: rename request_queue slab cache
For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags. These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory". Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.
The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e. O_DIRECT I/O with a persistent memory source/target). However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of idr_remove() is forbidden in the callback functions of
idr_for_each(). It is therefore unsafe to call idr_remove in
zram_remove().
This patch moves the call to idr_remove() from zram_remove() to
hot_remove_store(). In the detroy_devices() path, idrs are removed by
idr_destroy(). This solves an use-after-free detected by KASan.
[akpm@linux-foundation.org: fix coding stype, per Sergey]
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge first patch-bomb from Andrew Morton:
- A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to
-stable
- A few misc fixes
- OCFS2 updates
- Part of MM. Including pretty large changes to page-flags handling
and to thp management which have been buffered up for 2-3 cycles now.
I have a lot of MM material this time.
[ It turns out the THP part wasn't quite ready, so that got dropped from
this series - Linus ]
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
zsmalloc: reorganize struct size_class to pack 4 bytes hole
mm/zbud.c: use list_last_entry() instead of list_tail_entry()
zram/zcomp: do not zero out zcomp private pages
zram: pass gfp from zcomp frontend to backend
zram: try vmalloc() after kmalloc()
zram/zcomp: use GFP_NOIO to allocate streams
mm: add tracepoint for scanning pages
drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
mm/page_isolation: use macro to judge the alignment
mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
mm: rework virtual memory accounting
include/linux/memblock.h: fix ordering of 'flags' argument in comments
mm: move lru_to_page to mm_inline.h
Documentation/filesystems: describe the shared memory usage/accounting
memory-hotplug: don't BUG() in register_memory_resource()
hugetlb: make mm and fs code explicitly non-modular
mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
mm: make sure isolate_lru_page() is never called for tail page
vmstat: make vmstat_updater deferrable again and shut down on idle
...
Do not __GFP_ZERO allocated zcomp ->private pages. We keep allocated
streams around and use them for read/write requests, so we supply a
zeroed out ->private to compression algorithm as a scratch buffer only
once -- the first time we use that stream. For the rest of IO requests
served by this stream ->private usually contains some temporarily data
from the previous requests.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each zcomp backend uses own gfp flag but it's pointless because the
context they could be called is driven by upper layer(ie, zcomp
frontend). As well, zcomp frondend could call them in different
context. One context(ie, zram init part) is it should be better to make
sure successful allocation other context(ie, further stream allocation
part for accelarating I/O speed) is just optional so let's pass gfp down
from driver (ie, zcomp frontend) like normal MM convention.
[sergey.senozhatsky@gmail.com: add missing __vmalloc zero and highmem gfps]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we're using LZ4 multi compression streams for zram swap, we found
out page allocation failure message in system running test. That was
not only once, but a few(2 - 5 times per test). Also, some failure
cases were continually occurring to try allocation order 3.
In order to make parallel compression private data, we should call
kzalloc() with order 2/3 in runtime(lzo/lz4). But if there is no order
2/3 size memory to allocate in that time, page allocation fails. This
patch makes to use vmalloc() as fallback of kmalloc(), this prevents
page alloc failure warning.
After using this, we never found warning message in running test, also
It could reduce process startup latency about 60-120ms in each case.
For reference a call trace :
Binder_1: page allocation failure: order:3, mode:0x10c0d0
CPU: 0 PID: 424 Comm: Binder_1 Tainted: GW 3.10.49-perf-g991d02b-dirty #20
Call trace:
dump_backtrace+0x0/0x270
show_stack+0x10/0x1c
dump_stack+0x1c/0x28
warn_alloc_failed+0xfc/0x11c
__alloc_pages_nodemask+0x724/0x7f0
__get_free_pages+0x14/0x5c
kmalloc_order_trace+0x38/0xd8
zcomp_lz4_create+0x2c/0x38
zcomp_strm_alloc+0x34/0x78
zcomp_strm_multi_find+0x124/0x1ec
zcomp_strm_find+0xc/0x18
zram_bvec_rw+0x2fc/0x780
zram_make_request+0x25c/0x2d4
generic_make_request+0x80/0xbc
submit_bio+0xa4/0x15c
__swap_writepage+0x218/0x230
swap_writepage+0x3c/0x4c
shrink_page_list+0x51c/0x8d0
shrink_inactive_list+0x3f8/0x60c
shrink_lruvec+0x33c/0x4cc
shrink_zone+0x3c/0x100
try_to_free_pages+0x2b8/0x54c
__alloc_pages_nodemask+0x514/0x7f0
__get_free_pages+0x14/0x5c
proc_info_read+0x50/0xe4
vfs_read+0xa0/0x12c
SyS_read+0x44/0x74
DMA: 3397*4kB (MC) 26*8kB (RC) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 13796kB
[minchan@kernel.org: change vmalloc gfp and adding comment about gfp]
[sergey.senozhatsky@gmail.com: tweak comments and styles]
Signed-off-by: Kyeongdon Kim <kyeongdon.kim@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can end up allocating a new compression stream with GFP_KERNEL from
within the IO path, which may result is nested (recursive) IO
operations. That can introduce problems if the IO path in question is a
reclaimer, holding some locks that will deadlock nested IOs.
Allocate streams and working memory using GFP_NOIO flag, forbidding
recursive IO and FS operations.
An example:
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
git/20158 [HC0[0]:SC0[0]:HE1:SE1] takes:
(jbd2_handle){+.+.?.}, at: start_this_handle+0x4ca/0x555
{IN-RECLAIM_FS-W} state was registered at:
__lock_acquire+0x8da/0x117b
lock_acquire+0x10c/0x1a7
start_this_handle+0x52d/0x555
jbd2__journal_start+0xb4/0x237
__ext4_journal_start_sb+0x108/0x17e
ext4_dirty_inode+0x32/0x61
__mark_inode_dirty+0x16b/0x60c
iput+0x11e/0x274
__dentry_kill+0x148/0x1b8
shrink_dentry_list+0x274/0x44a
prune_dcache_sb+0x4a/0x55
super_cache_scan+0xfc/0x176
shrink_slab.part.14.constprop.25+0x2a2/0x4d3
shrink_zone+0x74/0x140
kswapd+0x6b7/0x930
kthread+0x107/0x10f
ret_from_fork+0x3f/0x70
irq event stamp: 138297
hardirqs last enabled at (138297): debug_check_no_locks_freed+0x113/0x12f
hardirqs last disabled at (138296): debug_check_no_locks_freed+0x33/0x12f
softirqs last enabled at (137818): __do_softirq+0x2d3/0x3e9
softirqs last disabled at (137813): irq_exit+0x41/0x95
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(jbd2_handle);
<Interrupt>
lock(jbd2_handle);
*** DEADLOCK ***
5 locks held by git/20158:
#0: (sb_writers#7){.+.+.+}, at: [<ffffffff81155411>] mnt_want_write+0x24/0x4b
#1: (&type->i_mutex_dir_key#2/1){+.+.+.}, at: [<ffffffff81145087>] lock_rename+0xd9/0xe3
#2: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8114f8e2>] lock_two_nondirectories+0x3f/0x6b
#3: (&sb->s_type->i_mutex_key#11/4){+.+.+.}, at: [<ffffffff8114f909>] lock_two_nondirectories+0x66/0x6b
#4: (jbd2_handle){+.+.?.}, at: [<ffffffff811e31db>] start_this_handle+0x4ca/0x555
stack backtrace:
CPU: 2 PID: 20158 Comm: git Not tainted 4.1.0-rc7-next-20150615-dbg-00016-g8bdf555-dirty #211
Call Trace:
dump_stack+0x4c/0x6e
mark_lock+0x384/0x56d
mark_held_locks+0x5f/0x76
lockdep_trace_alloc+0xb2/0xb5
kmem_cache_alloc_trace+0x32/0x1e2
zcomp_strm_alloc+0x25/0x73 [zram]
zcomp_strm_multi_find+0xe7/0x173 [zram]
zcomp_strm_find+0xc/0xe [zram]
zram_bvec_rw+0x2ca/0x7e0 [zram]
zram_make_request+0x1fa/0x301 [zram]
generic_make_request+0x9c/0xdb
submit_bio+0xf7/0x120
ext4_io_submit+0x2e/0x43
ext4_bio_write_page+0x1b7/0x300
mpage_submit_page+0x60/0x77
mpage_map_and_submit_buffers+0x10f/0x21d
ext4_writepages+0xc8c/0xe1b
do_writepages+0x23/0x2c
__filemap_fdatawrite_range+0x84/0x8b
filemap_flush+0x1c/0x1e
ext4_alloc_da_blocks+0xb8/0x117
ext4_rename+0x132/0x6dc
? mark_held_locks+0x5f/0x76
ext4_rename2+0x29/0x2b
vfs_rename+0x540/0x636
SyS_renameat2+0x359/0x44d
SyS_rename+0x1e/0x20
entry_SYSCALL_64_fastpath+0x12/0x6f
[minchan@kernel.org: add stable mark]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
floppy: make local variable non-static
exynos: fixes an incorrect header guard
dt-bindings: fixes some incorrect header guards
cpufreq-dt: correct dead link in documentation
cpufreq: ARM big LITTLE: correct dead link in documentation
treewide: Fix typos in printk
Documentation: filesystem: Fix typo in fs/eventfd.c
fs/super.c: use && instead of & for warn_on condition
Documentation: fix sysfs-ptp
lib: scatterlist: fix Kconfig description
This pull includes driver updates from the usual suspects (bfa, arcmsr,
scsi_dh_alua, lpfc, storvsc, cxlflash). The major change is the addition of
the hisi_sas driver, which is an ARM platform device for SAS. The other
change of note is an enormous style transformation to the atp870u driver
(which is our worst written SCSI driver).
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJWluqqAAoJEDeqqVYsXL0McuwH/1oqvFOagsvoDcwDyNAUR/eW
VAH454ndIJ0eSXORNfA7ko3ZQKa53x1WN9eKr+RHI7lpGCjwBz2MjnvQsnKISvXp
K0owkJTcAAF+Wdq7rdNlm1VlQHuLvG8TMTnno+NY3CtxCR2yiRWlctkNkjr0rWUv
leXJkXZSThkkiY/rEDZZXee8Ajwac87QT+ELEqCT2HueGZD+J8s59JpsOtZdt6Bj
n94ydOuct8hF3Xt3pdu1oDRpWpoJIyjHtYhdrvzSiKKBHTWtuq1oN0Cwnp0qtEDD
X3K1Mr0yBmAjTOsK+y+bZnJ1y7qJLLt5ZHmVixkzFWujXPNbrIsyYkV5eI432XA=
=ggNi
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull first round of SCSI updates from James Bottomley:
"This includes driver updates from the usual suspects (bfa, arcmsr,
scsi_dh_alua, lpfc, storvsc, cxlflash).
The major change is the addition of the hisi_sas driver, which is an
ARM platform device for SAS. The other change of note is an enormous
style transformation to the atp870u driver (which is our worst written
SCSI driver)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (169 commits)
cxlflash: Enable device id for future IBM CXL adapter
cxlflash: Resolve oops in wait_port_offline
cxlflash: Fix to resolve cmd leak after host reset
cxlflash: Removed driver date print
cxlflash: Fix to avoid virtual LUN failover failure
cxlflash: Fix to escalate LINK_RESET also on port 1
storvsc: Tighten up the interrupt path
storvsc: Refactor the code in storvsc_channel_init()
storvsc: Properly support Fibre Channel devices
storvsc: Fix a bug in the layout of the hv_fc_wwn_packet
mvsas: Add SGPIO support to Marvell 94xx
mpt3sas: A correction in unmap_resources
hpsa: Add box and bay information for enclosure devices
hpsa: Change SAS transport devices to bus 0.
hpsa: fix path_info_show
cciss: print max outstanding commands as a hex value
scsi_debug: Increase the reported optimal transfer length
lpfc: Update version to 11.0.0.10 for upstream patch set
lpfc: Use kzalloc instead of kmalloc
lpfc: Delete unnecessary checks before the function call "mempool_destroy"
...
Dividing a sector_t number should be done using sector_div rather than do_div
to optimize the 32-bit sector_t case, and with the latest do_div optimizations,
we now get a compile-time warning for this:
arch/arm/include/asm/div64.h:32:95: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'sector_t * {aka long unsigned int *}'
drivers/block/null_blk.c:521:81: warning: comparison of distinct pointer types lacks a cast
This changes the newly added code to use sector_div. It is a simplified version
of the original patch, as Linus Torvalds pointed out that we should not be using
an expensive division function in the first place.
This version was suggested by Matias Bjorling.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matias Bjorling <m@bjorling.me>
Fixes: b2b7e00148 ("null_blk: register as a LightNVM device")
Signed-off-by: Jens Axboe <axboe@fb.com>
Konrad writes:
The pull is based on converting the backend driver into an multiqueue
driver and exposing more than one queue to the frontend. As such we had
to modify the frontend and also fix a bunch of bugs around this.
The original work is based on Arianna Avanzini's work as an OPW intern.
Bob took over the work and had been massaging it for quite some time.
Also included are are features to 64KB page support for ARM and various
bug-fixes.
To implement sync I/O support within the LightNVM core, the end_io
functions are refactored to take an end_io function pointer instead of
testing for initialized media manager, followed by calling its end_io
function.
Sync I/O can then be implemented using a callback that signal I/O
completion. This is similar to the logic found in blk_to_execute_io().
By implementing it this way, the underlying device I/Os submission logic
is abstracted away from core, targets, and media managers.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>