linux

Commit Graph

Author	SHA1	Message	Date
Chaitanya Huilgol	ba988f87f5	libceph: tcp_nodelay support TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:40 +03:00
Ilya Dryomov	f646912d10	libceph: use mon_client.c/put_generic_request() more Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:37 +03:00
Ilya Dryomov	7a6fdeb2b1	libceph: nuke pool op infrastructure On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:37 +03:00
Andrea Arcangeli	7e33912849	mm: gup: use get_user_pages_unlocked This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:05 -08:00
Ilya Dryomov	d7d5a007b1	libceph: fix sparse endianness warnings The only real issue is the one in auth_x.c and it came with 3.19-rc1 merge. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-01-08 20:36:57 +03:00
Yan, Zheng	715e4cd405	libceph: specify position of extent operation allow specifying position of extent operation in multi-operations osd request. This is required for cephfs to convert inline data to normal data (compare xattr, then write object). Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:52 +03:00
Yan, Zheng	864e9197f1	libceph: add CREATE osd operation support Add CEPH_OSD_OP_CREATE support. Also change libceph to not treat CEPH_OSD_OP_DELETE as an extent op and add an assert to that end. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	d74b50bed0	libceph: add SETXATTR/CMPXATTR osd operations support Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	a3fc98005c	libceph: require cephx message signature by default Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	33d0733796	libceph: message signature support Signed-off-by: Yan, Zheng <zyan@redhat.com>	2014-12-17 20:09:50 +03:00
Yan, Zheng	ae385eaf24	libceph: store session key in cephx authorizer Session key is required when calculating message signature. Save the session key in authorizer, this avoid lookup ticket handler for each message Signed-off-by: Yan, Zheng <zyan@redhat.com>	2014-12-17 20:09:50 +03:00
Ilya Dryomov	4965fc38c4	libceph: nuke ceph_kvfree() Use kvfree() from linux/mm.h instead, which is identical. Also fix the ceph_buffer comment: we will allocate with kmalloc() up to 32k - the value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an implementation detail so don't mention it at all. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:50 +03:00
Ilya Dryomov	cc9f1f518c	libceph: change from BUG to WARN for __remove_osd() asserts No reason to use BUG_ON for osd request list assertions. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:26:34 +03:00
Ilya Dryomov	ba9d114ec5	libceph: clear r_req_lru_item in __unregister_linger_request() kick_requests() can put linger requests on the notarget list. This means we need to clear the much-overloaded req->r_req_lru_item in __unregister_linger_request() as well, or we get an assertion failure in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item). AFAICT the assumption was that registered linger requests cannot be on any of req->r_req_lru_item lists, but that's clearly not the case. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:21:14 +03:00
Ilya Dryomov	a390de0208	libceph: unlink from o_linger_requests when clearing r_osd Requests have to be unlinked from both osd->o_requests (normal requests) and osd->o_linger_requests (linger requests) lists when clearing req->r_osd. Otherwise __unregister_linger_request() gets confused and we trip over a !list_empty(&osd->o_linger_requests) assert in __remove_osd(). MON=1 OSD=1: # cat remove-osd.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 sleep 3 rbd map dne/dne # obtain a new osdmap as a side effect rbd unmap $DEV & # will block sleep 3 ceph osd in 0 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:21:13 +03:00
Ilya Dryomov	aaef31703a	libceph: do not crash on large auth tickets Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth tickets will have their buffers vmalloc'ed, which leads to the following crash in crypto: [ 28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0 [ 28.686032] IP: [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.686032] PGD 0 [ 28.688088] Oops: 0000 [#1] PREEMPT SMP [ 28.688088] Modules linked in: [ 28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305 [ 28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 28.688088] Workqueue: ceph-msgr con_work [ 28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000 [ 28.688088] RIP: 0010:[<ffffffff81392b42>] [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.688088] RSP: 0018:ffff8800d903f688 EFLAGS: 00010286 [ 28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0 [ 28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750 [ 28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880 [ 28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010 [ 28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000 [ 28.688088] FS: 00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 [ 28.688088] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0 [ 28.688088] Stack: [ 28.688088] ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32 [ 28.688088] ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020 [ 28.688088] ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010 [ 28.688088] Call Trace: [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81395d32>] blkcipher_walk_done+0x182/0x220 [ 28.688088] [<ffffffff813990bf>] crypto_cbc_encrypt+0x15f/0x180 [ 28.688088] [<ffffffff81399780>] ? crypto_aes_set_key+0x30/0x30 [ 28.688088] [<ffffffff8156c40c>] ceph_aes_encrypt2+0x29c/0x2e0 [ 28.688088] [<ffffffff8156d2a3>] ceph_encrypt2+0x93/0xb0 [ 28.688088] [<ffffffff8156d7da>] ceph_x_encrypt+0x4a/0x60 [ 28.688088] [<ffffffff8155b39d>] ? ceph_buffer_new+0x5d/0xf0 [ 28.688088] [<ffffffff8156e837>] ceph_x_build_authorizer.isra.6+0x297/0x360 [ 28.688088] [<ffffffff8112089b>] ? kmem_cache_alloc_trace+0x11b/0x1c0 [ 28.688088] [<ffffffff8156b496>] ? ceph_auth_create_authorizer+0x36/0x80 [ 28.688088] [<ffffffff8156ed83>] ceph_x_create_authorizer+0x63/0xd0 [ 28.688088] [<ffffffff8156b4b4>] ceph_auth_create_authorizer+0x54/0x80 [ 28.688088] [<ffffffff8155f7c0>] get_authorizer+0x80/0xd0 [ 28.688088] [<ffffffff81555a8b>] prepare_write_connect+0x18b/0x2b0 [ 28.688088] [<ffffffff81559289>] try_read+0x1e59/0x1f10 This is because we set up crypto scatterlists as if all buffers were kmalloc'ed. Fix it. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-11-13 22:21:12 +03:00
Ilya Dryomov	e9226d7c9f	libceph: eliminate unnecessary allocation in process_one_ticket() Commit `c27a3e4d66` ("libceph: do not hard code max auth ticket len") while fixing a buffer overlow tried to keep the same as much of the surrounding code as possible and introduced an unnecessary kmalloc() in the unencrypted ticket path. It is likely to fail on huge tickets, so get rid of it. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-31 23:43:08 +03:00
Mike Christie	89baaa570a	libceph: use memalloc flags for net IO This patch has ceph's lib code use the memalloc flags. If the VM layer needs to write data out to free up memory to handle new allocation requests, the block layer must be able to make forward progress. To handle that requirement we use structs like mempools to reserve memory for objects like bios and requests. The problem is when we send/receive block layer requests over the network layer, net skb allocations can fail and the system can lock up. To solve this, the memalloc related flags were added. NBD, iSCSI and NFS uses these flags to tell the network/vm layer that it should use memory reserves to fullfill allcation requests for structs like skbs. I am running ceph in a bunch of VMs in my laptop, so this patch was not tested very harshly. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-10-30 13:11:50 +03:00
Linus Torvalds	6b04908166	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is the long-awaited discard support for RBD (Guangliang Zhao, Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and performance and debugging improvements (Yan, Zheng, John Spray), and a smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) ceph: fix divide-by-zero in __validate_layout() rbd: rbd workqueues need a resque worker libceph: ceph-msgr workqueue needs a resque worker ceph: fix bool assignments libceph: separate multiple ops with commas in debugfs output libceph: sync osd op definitions in rados.h libceph: remove redundant declaration ceph: additional debugfs output ceph: export ceph_session_state_name function ceph: include the initial ACL in create/mkdir/mknod MDS requests ceph: use pagelist to present MDS request data libceph: reference counting pagelist ceph: fix llistxattr on symlink ceph: send client metadata to MDS ceph: remove redundant code for max file size verification ceph: remove redundant io_iter_advance() ceph: move ceph_find_inode() outside the s_mutex ceph: request xattrs if xattr_version is zero rbd: set the remaining discard properties to enable support rbd: use helpers to handle discard for layered images correctly ...	2014-10-15 06:46:01 +02:00
Ilya Dryomov	f9865f06f7	libceph: ceph-msgr workqueue needs a resque worker Commit `f363e45fd1` ("net/ceph: make ceph_msgr_wq non-reentrant") effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is wrong - libceph is very much a memory reclaim path, so restore it. Cc: stable@vger.kernel.org # needs backporting for < 3.12 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:04 -07:00
Ilya Dryomov	25f897773b	libceph: separate multiple ops with commas in debugfs output For requests with multiple ops, separate ops with commas instead of \t, which is a field separator here. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:03 -07:00
Ilya Dryomov	70b5bfa360	libceph: sync osd op definitions in rados.h Bring in missing osd ops and strings, use macros to eliminate multiple points of maintenance. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:02 -07:00
Yan, Zheng	e4339d28f6	libceph: reference counting pagelist this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:56:48 -07:00
Ilya Dryomov	91883cd27c	libceph: don't try checking queue_work() return value queue_work() doesn't "fail to queue", it returns false if work was already on a queue, which can't happen here since we allocate event_work right before we queue it. So don't bother at all. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:25 +04:00
Joe Perches	b9a678994b	libceph: Convert pr_warning to pr_warn Use the more common pr_warn. Other miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-10-14 21:03:23 +04:00
Li RongQing	589506f1e7	libceph: fix a use after free issue in osdmap_set_max_osd If the state variable is krealloced successfully, map->osd_state will be freed, once following two reallocation failed, and exit the function without resetting map->osd_state, map->osd_state become a wild pointer. fix it by resetting them after krealloc successfully. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-10-14 21:03:21 +04:00
Ilya Dryomov	dc220db03f	libceph: select CRYPTO_CBC in addition to CRYPTO_AES We want "cbc(aes)" algorithm, so select CRYPTO_CBC too, not just CRYPTO_AES. Otherwise on !CRYPTO_CBC kernels we fail rbd map/mount with libceph: error -2 building auth method x request Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-10-14 21:03:20 +04:00
Ilya Dryomov	2cc6128ab2	libceph: resend lingering requests with a new tid Both not yet registered (r_linger && list_empty(&r_linger_item)) and registered linger requests should use the new tid on resend to avoid the dup op detection logic on the OSDs, yet we were doing this only for "registered" case. Factor out and simplify the "registered" logic and use the new helper for "not registered" case as well. Fixes: http://tracker.ceph.com/issues/8806 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:19 +04:00
Ilya Dryomov	f671b581f1	libceph: abstract out ceph_osd_request enqueue logic Introduce __enqueue_request() and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:18 +04:00
Linus Torvalds	5e40d331bd	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris. Mostly ima, selinux, smack and key handling updates. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits) integrity: do zero padding of the key id KEYS: output last portion of fingerprint in /proc/keys KEYS: strip 'id:' from ca_keyid KEYS: use swapped SKID for performing partial matching KEYS: Restore partial ID matching functionality for asymmetric keys X.509: If available, use the raw subjKeyId to form the key description KEYS: handle error code encoded in pointer selinux: normalize audit log formatting selinux: cleanup error reporting in selinux_nlmsg_perm() KEYS: Check hex2bin()'s return when generating an asymmetric key ID ima: detect violations for mmaped files ima: fix race condition on ima_rdwr_violation_check and process_measurement ima: added ima_policy_flag variable ima: return an error code from ima_add_boot_aggregate() ima: provide 'ima_appraise=log' kernel option ima: move keyring initialization to ima_init() PKCS#7: Handle PKCS#7 messages that contain no X.509 certs PKCS#7: Better handling of unsupported crypto KEYS: Overhaul key identification when searching for asymmetric keys KEYS: Implement binary asymmetric key ID handling ...	2014-10-12 10:13:55 -04:00
David Howells	c06cfb08b8	KEYS: Remove key_type::match in favour of overriding default by match_preparse A previous patch added a ->match_preparse() method to the key type. This is allowed to override the function called by the iteration algorithm. Therefore, we can just set a default that simply checks for an exact match of the key description with the original criterion data and allow match_preparse to override it as needed. The key_type::match op is then redundant and can be removed, as can the user_match() function. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com>	2014-09-16 17:36:06 +01:00
Ilya Dryomov	c27a3e4d66	libceph: do not hard code max auth ticket len We hard code cephx auth ticket buffer size to 256 bytes. This isn't enough for any moderate setups and, in case tickets themselves are not encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but ceph_decode_copy() doesn't - it's just a memcpy() wrapper). Since the buffer is allocated dynamically anyway, allocated it a bit later, at the point where we know how much is going to be needed. Fixes: http://tracker.ceph.com/issues/8979 Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-09-10 20:08:36 +04:00
Ilya Dryomov	597cda3577	libceph: add process_one_ticket() helper Add a helper for processing individual cephx auth tickets. Needed for the next commit, which deals with allocating ticket buffers. (Most of the diff here is whitespace - view with git diff -b). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-09-10 20:08:35 +04:00
Sage Weil	73c3d4812b	libceph: gracefully handle large reply messages from the mon We preallocate a few of the message types we get back from the mon. If we get a larger message than we are expecting, fall back to trying to allocate a new one instead of blindly using the one we have. CC: stable@vger.kernel.org Signed-off-by: Sage Weil <sage@redhat.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-09-10 20:08:32 +04:00
Linus Torvalds	8d2d441ac4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is a lot of refactoring and hardening of the libceph and rbd code here from Ilya that fix various smaller bugs, and a few more important fixes with clone overlap. The main fix is a critical change to the request_fn handling to not sleep that was exposed by the recent mutex changes (which will also go to the 3.16 stable series). Yan Zheng has several fixes in here for CephFS fixing ACL handling, time stamps, and request resends when the MDS restarts. Finally, there are a few cleanups from Himangi Saraogi based on Coccinelle" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits) libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly rbd: remove extra newlines from rbd_warn() messages rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC rbd: rework rbd_request_fn() ceph: fix kick_requests() ceph: fix append mode write ceph: fix sizeof(struct tYpO *) typo ceph: remove redundant memset(0) rbd: take snap_id into account when reading in parent info rbd: do not read in parent info before snap context rbd: update mapping size only on refresh rbd: harden rbd_dev_refresh() and callers a bit rbd: split rbd_dev_spec_update() into two functions rbd: remove unnecessary asserts in rbd_dev_image_probe() rbd: introduce rbd_dev_header_info() rbd: show the entire chain of parent images ceph: replace comma with a semicolon rbd: use rbd_segment_name_free() instead of kfree() ceph: check zero length in ceph_sync_read() ceph: reset r_resend_mds after receiving -ESTALE ...	2014-08-13 17:43:29 -06:00
Ilya Dryomov	5f740d7e15	libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly Determining ->last_piece based on the value of ->page_offset + length is incorrect because length here is the length of the entire message. ->last_piece set to false even if page array data item length is <= PAGE_SIZE, which results in invalid length passed to ceph_tcp_{send,recv}page() and causes various asserts to fire. # cat pages-cursor-init.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar # rbd_resize calls librbd rbd_resize(), size is in bytes ./rbd_resize bar $(((4 << 20) + 512)) rbd resize --size 10 bar BAR_DEV=$(rbd map bar) # trigger a 512-byte copyup -- 512-byte page array data item dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5 The problem exists only in ceph_msg_data_pages_cursor_init(), ceph_msg_data_pages_advance() does the right thing. The size_t cast is unnecessary. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-08-09 11:27:32 +04:00
David Howells	7c3bec0a1f	KEYS: Ceph: Use user_match() Ceph can use user_match() instead of defining its own identical function. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> cc: Tommi Virtanen <tommi.virtanen@dreamhost.com>	2014-07-22 21:46:30 +01:00
David Howells	efa64c0978	KEYS: Ceph: Use key preparsing Make use of key preparsing in Ceph so that quota size determination can take place prior to keyring locking when a key is being added. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> cc: Tommi Virtanen <tommi.virtanen@dreamhost.com>	2014-07-22 21:46:23 +01:00
Ilya Dryomov	37ab77ac29	libceph: drop osd ref when canceling con work queue_con() bumps osd ref count. We should do the reverse when canceling con work. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:46 +04:00
Ilya Dryomov	2d05f082cb	libceph: nuke ceph_osdc_unregister_linger_request() Remove now unused ceph_osdc_unregister_linger_request(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:45 +04:00
Ilya Dryomov	c9f9b93ddf	libceph: introduce ceph_osdc_cancel_request() Introduce ceph_osdc_cancel_request() intended for canceling requests from the higher layers (rbd and cephfs). Because higher layers are in charge and are supposed to know what and when they are canceling, the request is not completed, only unref'ed and removed from the libceph data structures. __cancel_request() is no longer called before __unregister_request(), because __unregister_request() unconditionally revokes r_request and there is no point in trying to do it twice. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:44 +04:00
Ilya Dryomov	4f23409e0c	libceph: fix linger request check in __unregister_request() We should check if request is on the linger request list of any of the OSDs, not whether request is registered or not. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:44 +04:00
Ilya Dryomov	af59306455	libceph: unregister only registered linger requests Linger requests that have not yet been registered should not be unregistered by __unregister_linger_request(). This messes up ref count and leads to use-after-free. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:44 +04:00
Ilya Dryomov	7c6e6fc53e	libceph: assert both regular and lingering lists in __remove_osd() It is important that both regular and lingering requests lists are empty when the OSD is removed. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:43 +04:00
Ilya Dryomov	6562d661d2	libceph: harden ceph_osdc_request_release() a bit Add some WARN_ONs to alert us when we try to destroy requests that are still registered. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:43 +04:00
Ilya Dryomov	9e94af202a	libceph: move and add dout()s to ceph_osdc_request_{get,put}() Add dout()s to ceph_osdc_request_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:43 +04:00
Ilya Dryomov	0215e44bb3	libceph: move and add dout()s to ceph_msg_{get,put}() Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:43 +04:00
Ilya Dryomov	bbf37ec3a6	libceph: add maybe_move_osd_to_lru() and switch to it Abstract out __move_osd_to_lru() logic from __unregister_request() and __unregister_linger_request(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:43 +04:00
Ilya Dryomov	1d0326b13b	libceph: rename ceph_osd_request::r_linger_osd to r_linger_osd_item So that: req->r_osd_item --> osd->o_requests list req->r_linger_osd_item --> osd->o_linger_requests list Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-07-08 15:08:42 +04:00
Linus Torvalds	6d87c225f5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...	2014-06-12 23:06:23 -07:00
Linus Torvalds	f9da455b93	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...	2014-06-12 14:27:40 -07:00
Al Viro	9c1d5284c7	Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus Backmerge of dcache.c changes from mainline. It's that, or complete rebase... Conflicts: fs/splice.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-06-12 00:28:09 -04:00
stephen hemminger	f647944995	ceph: remove bogus extern Sparse complained about this bogus extern on definition of a function. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-06-11 15:39:19 -07:00
Ilya Dryomov	6044cde6f2	libceph: add ceph_monc_wait_osdmap() Add ceph_monc_wait_osdmap(), which will block until the osdmap with the specified epoch is received or timeout occurs. Export both of these as they are going to be needed by rbd. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-06-06 09:29:57 +08:00
Ilya Dryomov	513a8243d6	libceph: mon_get_version request infrastructure Add support for mon_get_version requests to libceph. This reuses much of the ceph_mon_generic_request infrastructure, with one exception. Older OSDs don't set mon_get_version reply hdr->tid even if the original request had a non-zero tid, which makes it impossible to lookup ceph_mon_generic_request contexts by tid in get_generic_reply() for such replies. As a workaround, we allocate a reply message on the reply path. This can probably interfere with revoke, but I don't see a better way. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-06-06 09:29:57 +08:00
Ilya Dryomov	002b36ba5e	libceph: recognize poolop requests in debugfs Recognize poolop requests in debugfs monc dump, fix prink format specifiers - tid is unsigned. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-06-06 09:29:56 +08:00
Ilya Dryomov	f140662f35	crush: decode and initialize chooseleaf_vary_r Commit `e2b149cc4b` ("crush: add chooseleaf_vary_r tunable") added the crush_map::chooseleaf_vary_r field but missed the decode part. This lead to misdirected requests caused by incorrect raw crush mapping sets. Fixes: http://tracker.ceph.com/issues/8226 Reported-and-Tested-by: Dmitry Smirnov <onlyjob@member.fsf.org> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-05-16 21:29:55 +04:00
Chunwei Chen	178eda29ca	libceph: fix corruption when using page_count 0 page in rbd It has been reported that using ZFSonLinux on rbd will result in memory corruption. The bug report can be found here: https://github.com/zfsonlinux/spl/issues/241 http://tracker.ceph.com/issues/7790 The reason is that ZFS will send pages with page_count 0 into rbd, which in turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with page_count 0, as it will do get_page and put_page, and erroneously free the page. This type of issue has been noted before, and handled in iscsi, drbd, etc. So, rbd should also handle this. This fix address this issue by fall back to slower sendmsg when page_count 0 detected. Cc: Sage Weil <sage@inktank.com> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: stable@vger.kernel.org Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-05-16 21:29:26 +04:00
Al Viro	2b777c9dd9	ceph_sync_read: stop poking into iov_iter guts Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:39:42 -04:00
Linus Torvalds	5575eeb7b9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "First, there is a critical fix for the new primary-affinity function that went into -rc1. The second batch of patches from Zheng fix a range of problems with directory fragmentation, readdir, and a few odds and ends for cephfs" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: reserve caps for file layout/lock MDS requests ceph: avoid releasing caps that are being used ceph: clear directory's completeness when creating file libceph: fix non-default values check in apply_primary_affinity() ceph: use fpos_cmp() to compare dentry positions ceph: check directory's completeness before emitting directory entry	2014-05-05 15:17:02 -07:00
Ilya Dryomov	92b2e75158	libceph: fix non-default values check in apply_primary_affinity() osd_primary_affinity array is indexed into incorrectly when checking for non-default primary-affinity values. This nullifies the impact of the rest of the apply_primary_affinity() and results in misdirected requests. if (osds[i] != CRUSH_ITEM_NONE && osdmap->osd_primary_affinity[i] != ^^^ CEPH_OSD_DEFAULT_PRIMARY_AFFINITY) { For a pool with size 2, this always ends up checking osd0 and osd1 primary_affinity values, instead of the values that correspond to the osds in question. E.g., given a [2,3] up set and a [max,max,0,max] primary affinity vector, requests are still sent to osd2, because both osd0 and osd1 happen to have max primary_affinity values and therefore we return from apply_primary_affinity() early on the premise that all osds in the given set have max (default) values. Fix it. Fixes: http://tracker.ceph.com/issues/7954 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>	2014-04-28 12:54:10 -07:00
David S. Miller	676d23690f	net: Fix use after free by removing length arg from sk_data_ready callbacks. Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-04-11 16:15:36 -04:00
Linus Torvalds	240cd6a817	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "The biggest chunk is a series of patches from Ilya that add support for new Ceph osd and crush map features, including some new tunables, primary affinity, and the new encoding that is needed for erasure coding support. This brings things into parity with the server side and the looming firefly release. There is also support for allocation hints in RBD that help limit fragmentation on the server side. There is also a series of patches from Zheng fixing NFS reexport, directory fragmentation support, flock vs fnctl behavior, and some issues with clustered MDS. Finally, there are some miscellaneous fixes from Yunchuan Wen for fscache, Fabian Frederick for ACLs, and from me for fsync(dirfd) behavior" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (79 commits) ceph: skip invalid dentry during dcache readdir libceph: dump pool {read,write}_tier to debugfs libceph: output primary affinity values on osdmap updates ceph: flush cap release queue when trimming session caps ceph: don't grabs open file reference for aborted request ceph: drop extra open file reference in ceph_atomic_open() ceph: preallocate buffer for readdir reply libceph: enable PRIMARY_AFFINITY feature bit libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() libceph: add support for osd primary affinity libceph: add support for primary_temp mappings libceph: return primary from ceph_calc_pg_acting() libceph: switch ceph_calc_pg_acting() to new helpers libceph: introduce apply_temps() helper libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers libceph: ceph_can_shift_osds(pool) and pool type defines libceph: ceph_osd_{exists,is_up,is_down}(osd) definitions libceph: enable OSDMAP_ENC feature bit libceph: primary_affinity decode bits libceph: primary_affinity infrastructure ...	2014-04-07 11:09:13 -07:00
Ilya Dryomov	8a53f23fcd	libceph: dump pool {read,write}_tier to debugfs Dump pool {read,write}_tier to debugfs. While at it, fixup printk type specifiers and remove the unnecessary cast to unsigned long long. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-04-04 21:08:29 -07:00
Ilya Dryomov	f31da0f3e1	libceph: output primary affinity values on osdmap updates Similar to osd weights, output primary affinity values on incremental osdmap updates. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>	2014-04-04 21:08:28 -07:00
Ilya Dryomov	c4c1228525	libceph: redo ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() and get rid of the now unused calc_pg_raw(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:19 -07:00
Ilya Dryomov	47ec1f3cc4	libceph: add support for osd primary affinity Respond to non-default primary_affinity values accordingly. (Primary affinity allows the admin to shift 'primary responsibility' away from specific osds, effectively shifting around the read side of the workload and whatever overhead is incurred by peering and writes by virtue of being the primary). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:17 -07:00
Ilya Dryomov	5e8d4d36bf	libceph: add support for primary_temp mappings Change apply_temp() to override primary in the same way pg_temp overrides osd set. primary_temp overrides pg_temp primary too. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:16 -07:00
Ilya Dryomov	8008ab1080	libceph: return primary from ceph_calc_pg_acting() In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:14 -07:00
Ilya Dryomov	ac972230e2	libceph: switch ceph_calc_pg_acting() to new helpers Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:13 -07:00
Ilya Dryomov	45966c3467	libceph: introduce apply_temps() helper apply_temp() helper for applying various temporary mappings (at this point only pg_temp mappings) to the up set, therefore transforming it into an acting set. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:11 -07:00
Ilya Dryomov	2bd93d4d7e	libceph: introduce pg_to_raw_osds() and raw_to_up_osds() helpers pg_to_raw_osds() helper for computing a raw (crush) set, which can contain non-existant and down osds. raw_to_up_osds() helper for pruning non-existant and down osds from the raw set, therefore transforming it into an up set, and determining up primary. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:10 -07:00
Ilya Dryomov	63a6993f52	libceph: primary_affinity decode bits Add two helpers to decode primary_affinity (full map, vector<u32>) and new_primary_affinity (inc map, map<u32, u32>) and switch to them. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:04 -07:00
Ilya Dryomov	2cfa34f2d6	libceph: primary_affinity infrastructure Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:02 -07:00
Ilya Dryomov	d286de796a	libceph: primary_temp decode bits Add a common helper to decode both primary_temp (full map, map<pg_t, u32>) and new_primary_temp (inc map, same) and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:08:00 -07:00
Ilya Dryomov	9686f94c8c	libceph: primary_temp infrastructure Add primary_temp mappings infrastructure. struct ceph_pg_mapping is overloaded, primary_temp mappings are stored in an rb-tree, rooted at ceph_osdmap, in a manner similar to pg_temp mappings. Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'primary_temp <pgid> <osd>' per line, e.g: primary_temp 2.6 4 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:58 -07:00
Ilya Dryomov	35a935d75d	libceph: generalize ceph_pg_mapping In preparation for adding support for primary_temp mappings, generalize struct ceph_pg_mapping so it can hold mappings other than pg_temp. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:57 -07:00
Ilya Dryomov	ec7af97258	libceph: introduce get_osdmap_client_data_v() Full and incremental osdmaps are structured identically and have identical headers. Add a helper to decode both "old" (16-bit version, v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap enconding headers and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:55 -07:00
Ilya Dryomov	10db634e20	libceph: introduce decode{,_new}_pg_temp() and switch to them Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:53 -07:00
Ilya Dryomov	4d60351f90	libceph: switch osdmap_set_max_osd() to krealloc() Use krealloc() instead of rolling our own. (krealloc() with a NULL first argument acts as a kmalloc()). Properly initalize the new array elements. This is needed to make future additions to osdmap easier. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:52 -07:00
Ilya Dryomov	433fbdd31d	libceph: introduce decode{,_new}_pools() and switch to them Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:50 -07:00
Ilya Dryomov	0f70c7eedb	libceph: rename __decode_pool{,_names}() to decode_pool{,_names}() To be in line with all the other osdmap decode helpers. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:49 -07:00
Ilya Dryomov	53bbaba9d8	libceph: fix and clarify ceph_decode_need() sizes Sum up sizeof(...) results instead of (incorrectly) hard-coding the number of bytes, expressed in ints and longs. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:47 -07:00
Ilya Dryomov	9464d00862	libceph: nuke bogus encoding version check in osdmap_apply_incremental() Only version 6 of osdmap encoding is supported, anything other than version 6 results in an error and halts the decoding process. Checking if version is >= 5 is therefore bogus. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:46 -07:00
Ilya Dryomov	86f1742b94	libceph: fixup error handling in osdmap_apply_incremental() The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Follow osdmap_decode() and fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:44 -07:00
Ilya Dryomov	9902e682c7	libceph: fix crush_decode() call site in osdmap_decode() The size of the memory area feeded to crush_decode() should be limited not only by osdmap end, but also by the crush map length. Also, drop unnecessary dout() (dout() in crush_decode() conveys the same info) and step past crush map only if it is decoded successfully. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:43 -07:00
Ilya Dryomov	2d88b2e081	libceph: check length of osdmap osd arrays Check length of osd_state, osd_weight and osd_addr arrays. They should all have exactly max_osd elements after the call to osdmap_set_max_osd(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:41 -07:00
Ilya Dryomov	3977058c46	libceph: safely decode max_osd value in osdmap_decode() max_osd value is not covered by any ceph_decode_need(). Use a safe version of ceph_decode_* macro to decode it. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:40 -07:00
Ilya Dryomov	597b52f6ca	libceph: fixup error handling in osdmap_decode() The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:38 -07:00
Ilya Dryomov	a2505d63ee	libceph: split osdmap allocation and decode steps Split osdmap allocation and initialization into a separate function, ceph_osdmap_decode(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:37 -07:00
Ilya Dryomov	38a8d56023	libceph: dump osdmap and enhance output on decode errors Dump osdmap in hex on both full and incremental decode errors, to make it easier to match the contents with error offset. dout() map epoch and max_osd value on success. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:35 -07:00
Ilya Dryomov	1c00240e00	libceph: dump pg_temp mappings to debugfs Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:34 -07:00
Ilya Dryomov	0a2800d728	libceph: do not prefix osd lines with \t in debugfs output To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:32 -07:00
Ilya Dryomov	35fea3a18a	libceph: refer to osdmap directly in osdmap_show() To make it more readable and save screen space. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-04 21:07:31 -07:00
Ilya Dryomov	d83ed858f1	crush: add SET_CHOOSELEAF_VARY_R step This lets you adjust the vary_r tunable on a per-rule basis. Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-04-04 21:07:28 -07:00
Ilya Dryomov	e2b149cc4b	crush: add chooseleaf_vary_r tunable The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-04-04 21:07:26 -07:00
Ilya Dryomov	6ed1002f36	crush: allow crush rules to set (re)tries counts to 0 These two fields are misnomers; they are retry counts. Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-04-04 21:07:25 -07:00
Ilya Dryomov	48a163dbb5	crush: fix off-by-one errors in total_tries refactor Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH code to allow adjustment of the retry counts on a per-pool basis. That commit had an off-by-one bug: the previous "tries" counter was a retry count, not a try count, but the new code was passing in 1 meaning there should be no retries. Fix the ftotal vs tries comparison to use < instead of <= to fix the problem. Note that the original code used <= here, which means the global "choose_total_tries" tunable is actually counting retries. Compensate for that by adding 1 in crush_do_rule when we pull the tunable into the local variable. This was noticed looking at output from a user provided osdmap. Unfortunately the map doesn't illustrate the change in mapping behavior and I haven't managed to construct one yet that does. Inspection of the crush debug output now aligns with prior versions, though. Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>	2014-04-04 21:07:23 -07:00
Yan, Zheng	d90deda69c	libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance() When there is no more data, ceph_msg_data_{pages,pagelist}_advance() should not move on to the next page. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>	2014-04-04 21:07:15 -07:00
Ilya Dryomov	c647b8a8c6	libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op This is primarily for rbd's benefit and is supposed to combat fragmentation: "... knowing that rbd images have a 4m size, librbd can pass a hint that will let the osd do the xfs allocation size ioctl on new files so that they are allocated in 1m or 4m chunks. We've seen cases where users with rbd workloads have very high levels of fragmentation in xfs and this would mitigate that and probably have a pretty nice performance benefit." SETALLOCHINT is considered advisory, so our backwards compatibility mechanism here is to set FAILOK flag for all SETALLOCHINT ops. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-04-03 10:33:51 +08:00

1 2 3 4 5 ...

680 Commits