linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	0c76c6ba24	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) rbd: use GFP_NOIO in rbd_obj_request_create() crush: fix a bug in tree bucket decode libceph: Fix ceph_tcp_sendpage()'s more boolean usage libceph: Remove spurious kunmap() of the zero page rbd: queue_depth map option rbd: store rbd_options in rbd_device rbd: terminate rbd_opts_tokens with Opt_err ceph: fix ceph_writepages_start() rbd: bump queue_max_segments ceph: rework dcache readdir crush: sync up with userspace crush: fix crash from invalid 'take' argument ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: pre-allocate data structure that tracks caps flushing ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: send TID of the oldest pending caps flush to MDS ceph: track pending caps flushing globally ceph: track pending caps flushing accurately libceph: fix wrong name "Ceph filesystem for Linux" ceph: fix directory fsync ...	2015-07-02 11:35:00 -07:00
Ilya Dryomov	82cd003a77	crush: fix a bug in tree bucket decode struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe() should be used. -Wconversion catches this, but I guess it went unnoticed in all the noise it spews. The actual problem (at least for common crushmaps) isn't the u32 -> u8 truncation though - it's the advancement by 4 bytes instead of 1 in the crushmap buffer. Fixes: http://tracker.ceph.com/issues/2759 Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>	2015-07-01 00:46:35 +03:00
Benoît Canet	c2cfa19400	libceph: Fix ceph_tcp_sendpage()'s more boolean usage From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h: bool last_piece; /* current is last piece / In ceph_msg_data_next(): last_piece = cursor->last_piece; A call to ceph_msg_data_next() is followed by: ret = ceph_tcp_sendpage(con->sock, page, page_offset, length, last_piece); while ceph_tcp_sendpage() is: static int ceph_tcp_sendpage(struct socket sock, struct page page, int offset, size_t size, bool more) The logic is inverted: correct it. Signed-off-by: Benoît Canet <benoit.canet@nodalink.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-29 20:03:46 +03:00
Benoît Canet	6ba8edc0bc	libceph: Remove spurious kunmap() of the zero page ceph_tcp_sendpage already does the work of mapping/unmapping the zero page if needed. Signed-off-by: Benoît Canet <benoit.canet@nodalink.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-25 18:30:56 +03:00
Ilya Dryomov	b459be739f	crush: sync up with userspace .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff between kernel and userspace"). This fixes a bunch of recently pulled coding style issues and makes includes a bit cleaner. A patch "crush:Make the function crush_ln static" from Nicholas Krause <xerofoify@gmail.com> is folded in as crush_ln() has been made static in userspace as well. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-25 11:49:31 +03:00
Ilya Dryomov	8f529795ba	crush: fix crash from invalid 'take' argument Verify that the 'take' argument is a valid device or bucket. Otherwise ignore it (do not add the value to the working vector). Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-25 11:49:31 +03:00
Hong Zhiguo	6c13a6bb55	libceph: fix wrong name "Ceph filesystem for Linux" modinfo libceph prints the module name "Ceph filesystem for Linux", which is same as the real fs module ceph. It's confusing. Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-25 11:49:30 +03:00
Ilya Dryomov	216639dd50	libceph: a couple tweaks for wait loops - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:29 +03:00
Ilya Dryomov	a319bf56a6	libceph: store timeouts in jiffies, verify user input There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:29 +03:00
Ilya Dryomov	b01da6a08c	libceph: use kvfree() instead of open-coding it This one sneaked in through vfs tree with commit `2b777c9dd9` ("ceph_sync_read: stop poking into iov_iter guts"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-06-25 11:49:28 +03:00
Yan, Zheng	144cba1493	libceph: allow setting osd_req_op's flags Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:27 +03:00
Yan, Zheng	66ba609f7b	libceph: properly release STAT request's raw_data_in Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-06-25 11:49:27 +03:00
David S. Miller	dda922c831	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-01 22:51:30 -07:00
Ilya Dryomov	521a04d06a	Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" This reverts commit `ba9d114ec5`. .. which introduced a regression that prevented all lingering requests requeued in kick_requests() from ever being sent to the OSDs, resulting in a lot of missed notifies. In retrospect it's pretty obvious that r_req_lru_item item in the case of lingering requests can be used not only for notarget, but also for unsent linkage due to how tightly actual map and enqueue operations are coupled in __map_request(). The assertion that was being silenced is taken care of in the previous ("libceph: request a new osdmap if lingering request maps to no osd") commit: by always kicking homeless lingering requests we ensure that none of them ends up on the notarget list outside of the critical section guarded by request_mutex. Cc: stable@vger.kernel.org # 3.18+, needs `b049453221` "libceph: request a new osdmap if lingering request maps to no osd" Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>	2015-05-20 21:02:46 +03:00
Ilya Dryomov	b049453221	libceph: request a new osdmap if lingering request maps to no osd This commit does two things. First, if there are any homeless lingering requests, we now request a new osdmap even if the osdmap that is being processed brought no changes, i.e. if a given lingering request turned homeless in one of the previous epochs and remained homeless in the current epoch. Not doing so leaves us with a stale osdmap and as a result we may miss our window for reestablishing the watch and lose notifies. MON=1 OSD=1: # cat linger-needmap.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 rbd map dne/dne # obtain a new osdmap as a side effect (!) sleep 1 ceph osd in 0 rbd resize --size 2 test # rbd info test \| grep size -> 2M # blockdev --getsize $DEV -> 1M N.B.: Not obtaining a new osdmap in between "osd out" and "osd in" above is enough to make it miss that resize notify, but that is a bug^Wlimitation of ceph watch/notify v1. Second, homeless lingering requests are now kicked just like those lingering requests whose mapping has changed. This is mainly to recognize that a homeless lingering request makes no sense and to preserve the invariant that a registered lingering request is not sitting on any of r_req_lru_item lists. This spares us a WARN_ON, which commit `ba9d114ec5` ("libceph: clear r_req_lru_item in __unregister_linger_request()") tried to fix the _wrong_ way. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>	2015-05-20 21:02:14 +03:00
Eric W. Biederman	eeb1bd5c40	net: Add a struct net parameter to sock_create_kern This is long overdue, and is part of cleaning up how we allocate kernel sockets that don't reference count struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-11 10:50:17 -04:00
Ilya Dryomov	958a27658d	crush: straw2 bucket type with an efficient 64-bit crush_ln() This is an improved straw bucket that correctly avoids any data movement between items A and B when neither A nor B's weights are changed. Said differently, if we adjust the weight of item C (including adding it anew or removing it completely), we will only see inputs move to or from C, never between other items in the bucket. Notably, there is not intermediate scaling factor that needs to be calculated. The mapping function is a simple function of the item weights. The below commits were squashed together into this one (mostly to avoid adding and then yanking a ~6000 lines worth of crush_ln_table): - crush: add a straw2 bucket type - crush: add crush_ln to calculate nature log efficently - crush: improve straw2 adjustment slightly - crush: change crush_ln to provide 32 more digits - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2 - crush/mapper: fix divide-by-0 in straw2 (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need to create a proper compat.h in ceph.git) Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53, 32a1ead92efcd351822d22a5fc37d159c65c1338, 6289912418c4a3597a11778bcf29ed5415117ad9, 35fcb04e2945717cf5cfe150b9fa89cb3d2303a1, 6445d9ee7290938de1e4ee9563912a6ab6d8ee5f, b5921d55d16796e12d66ad2c4add7305f9ce2353. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-22 18:33:43 +03:00
Ilya Dryomov	45002267e8	crush: ensuring at most num-rep osds are selected Crush temporary buffers are allocated as per replica size configured by the user. When there are more final osds (to be selected as per rule) than the replicas, buffer overlaps and it causes crash. Now, it ensures that at most num-rep osds are selected even if more number of osds are allowed by the rule. Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0, 234b066ba04976783d15ff2abc3e81b6cc06fb10. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-22 18:33:42 +03:00
Ilya Dryomov	9be6df215a	crush: drop unnecessary include from mapper.c Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-22 18:33:42 +03:00
Ilya Dryomov	5cf7bd3012	libceph: expose client options through debugfs Add a client_options attribute for showing libceph options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-20 18:55:39 +03:00
Ilya Dryomov	ff40f9ae95	libceph, ceph: split ceph_show_options() Split ceph_show_options() into two pieces and move the piece responsible for printing client (libceph) options into net/ceph. This way people adding a libceph option wouldn't have to remember to update code in fs/ceph. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-20 18:55:38 +03:00
Ilya Dryomov	67c64eb742	libceph: don't overwrite specific con error msgs - specific con->error_msg messages (e.g. "protocol version mismatch") end up getting overwritten by a catch-all "socket error on read / write", introduced in commit `3a140a0d5c` ("libceph: report socket read/write error message") - "bad message sequence # for incoming message" loses to "bad crc" due to the fact that -EBADMSG is used for both Fix it, and tidy up con->error_msg assignments and pr_errs while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2015-04-20 18:55:37 +03:00
Ilya Dryomov	6d7fdb0ab3	Revert "libceph: use memalloc flags for net IO" This reverts commit `89baaa570a`. Dirty page throttling should be sufficient for us in the general case so there is no need to use __GFP_MEMALLOC - it would be needed only in the swap-over-rbd case, which we currently don't support. (It would probably take approximately the commit that is being reverted to add that support, but we would also need the "swap" option to distinguish from the general case and make sure swap ceph_client-s aren't shared with anything else.) See ceph-devel threads [1] and [2] for the details of why enabling pfmemalloc reserves for all cases is a bad thing. On top of potential system lockups related to drained emergency reserves, this turned out to cause ceph lockups in case peers are on the same host and communicating via loopback due to sk_filter() dropping pfmemalloc skbs on the receiving side because the receiving loopback socket is not tagged with SOCK_MEMALLOC. [1] "SOCK_MEMALLOC vs loopback" http://www.spinics.net/lists/ceph-devel/msg22998.html [2] "[PATCH] libceph: don't set memalloc flags in loopback case" http://www.spinics.net/lists/ceph-devel/msg23392.html Conflicts: net/ceph/messenger.c [ context: tcp_nodelay option ] Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.18+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mel Gorman <mgorman@suse.de>	2015-04-07 19:08:35 +03:00
Linus Torvalds	4533f6e27a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...	2015-02-19 14:14:42 -08:00
Ilya Dryomov	b28ec2f37e	libceph: kfree() in put_osd() shouldn't depend on authorizer `a255651d4c` ("ceph: ensure auth ops are defined before use") made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-02-19 14:27:51 +03:00
Ilya Dryomov	7eb71e0351	libceph: fix double __remove_osd() problem It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: <osd3 is idle, on the osd lru list> <con reset - osd3> con_fault_finish() osd_reset() <osdmap - osd3 down> ceph_osdc_handle_map() <takes map_sem> kick_requests() <takes request_mutex> reset_changed_osds() __reset_osd() __remove_osd() <releases request_mutex> <releases map_sem> <takes map_sem> <takes request_mutex> __kick_osd_requests() __reset_osd() __remove_osd() <-- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2015-02-19 14:27:50 +03:00
Chaitanya Huilgol	ba988f87f5	libceph: tcp_nodelay support TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:40 +03:00
Ilya Dryomov	f646912d10	libceph: use mon_client.c/put_generic_request() more Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:37 +03:00
Ilya Dryomov	7a6fdeb2b1	libceph: nuke pool op infrastructure On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-02-19 13:31:37 +03:00
Andrea Arcangeli	7e33912849	mm: gup: use get_user_pages_unlocked This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2015-02-11 17:06:05 -08:00
Ilya Dryomov	d7d5a007b1	libceph: fix sparse endianness warnings The only real issue is the one in auth_x.c and it came with 3.19-rc1 merge. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2015-01-08 20:36:57 +03:00
Yan, Zheng	715e4cd405	libceph: specify position of extent operation allow specifying position of extent operation in multi-operations osd request. This is required for cephfs to convert inline data to normal data (compare xattr, then write object). Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:52 +03:00
Yan, Zheng	864e9197f1	libceph: add CREATE osd operation support Add CEPH_OSD_OP_CREATE support. Also change libceph to not treat CEPH_OSD_OP_DELETE as an extent op and add an assert to that end. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	d74b50bed0	libceph: add SETXATTR/CMPXATTR osd operations support Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	a3fc98005c	libceph: require cephx message signature by default Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:51 +03:00
Yan, Zheng	33d0733796	libceph: message signature support Signed-off-by: Yan, Zheng <zyan@redhat.com>	2014-12-17 20:09:50 +03:00
Yan, Zheng	ae385eaf24	libceph: store session key in cephx authorizer Session key is required when calculating message signature. Save the session key in authorizer, this avoid lookup ticket handler for each message Signed-off-by: Yan, Zheng <zyan@redhat.com>	2014-12-17 20:09:50 +03:00
Ilya Dryomov	4965fc38c4	libceph: nuke ceph_kvfree() Use kvfree() from linux/mm.h instead, which is identical. Also fix the ceph_buffer comment: we will allocate with kmalloc() up to 32k - the value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an implementation detail so don't mention it at all. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>	2014-12-17 20:09:50 +03:00
Ilya Dryomov	cc9f1f518c	libceph: change from BUG to WARN for __remove_osd() asserts No reason to use BUG_ON for osd request list assertions. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:26:34 +03:00
Ilya Dryomov	ba9d114ec5	libceph: clear r_req_lru_item in __unregister_linger_request() kick_requests() can put linger requests on the notarget list. This means we need to clear the much-overloaded req->r_req_lru_item in __unregister_linger_request() as well, or we get an assertion failure in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item). AFAICT the assumption was that registered linger requests cannot be on any of req->r_req_lru_item lists, but that's clearly not the case. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:21:14 +03:00
Ilya Dryomov	a390de0208	libceph: unlink from o_linger_requests when clearing r_osd Requests have to be unlinked from both osd->o_requests (normal requests) and osd->o_linger_requests (linger requests) lists when clearing req->r_osd. Otherwise __unregister_linger_request() gets confused and we trip over a !list_empty(&osd->o_linger_requests) assert in __remove_osd(). MON=1 OSD=1: # cat remove-osd.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 sleep 3 rbd map dne/dne # obtain a new osdmap as a side effect rbd unmap $DEV & # will block sleep 3 ceph osd in 0 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-11-13 22:21:13 +03:00
Ilya Dryomov	aaef31703a	libceph: do not crash on large auth tickets Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth tickets will have their buffers vmalloc'ed, which leads to the following crash in crypto: [ 28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0 [ 28.686032] IP: [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.686032] PGD 0 [ 28.688088] Oops: 0000 [#1] PREEMPT SMP [ 28.688088] Modules linked in: [ 28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305 [ 28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 28.688088] Workqueue: ceph-msgr con_work [ 28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000 [ 28.688088] RIP: 0010:[<ffffffff81392b42>] [<ffffffff81392b42>] scatterwalk_pagedone+0x22/0x80 [ 28.688088] RSP: 0018:ffff8800d903f688 EFLAGS: 00010286 [ 28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0 [ 28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750 [ 28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880 [ 28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010 [ 28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000 [ 28.688088] FS: 00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 [ 28.688088] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0 [ 28.688088] Stack: [ 28.688088] ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32 [ 28.688088] ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020 [ 28.688088] ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010 [ 28.688088] Call Trace: [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81392ca8>] scatterwalk_done+0x38/0x40 [ 28.688088] [<ffffffff81395d32>] blkcipher_walk_done+0x182/0x220 [ 28.688088] [<ffffffff813990bf>] crypto_cbc_encrypt+0x15f/0x180 [ 28.688088] [<ffffffff81399780>] ? crypto_aes_set_key+0x30/0x30 [ 28.688088] [<ffffffff8156c40c>] ceph_aes_encrypt2+0x29c/0x2e0 [ 28.688088] [<ffffffff8156d2a3>] ceph_encrypt2+0x93/0xb0 [ 28.688088] [<ffffffff8156d7da>] ceph_x_encrypt+0x4a/0x60 [ 28.688088] [<ffffffff8155b39d>] ? ceph_buffer_new+0x5d/0xf0 [ 28.688088] [<ffffffff8156e837>] ceph_x_build_authorizer.isra.6+0x297/0x360 [ 28.688088] [<ffffffff8112089b>] ? kmem_cache_alloc_trace+0x11b/0x1c0 [ 28.688088] [<ffffffff8156b496>] ? ceph_auth_create_authorizer+0x36/0x80 [ 28.688088] [<ffffffff8156ed83>] ceph_x_create_authorizer+0x63/0xd0 [ 28.688088] [<ffffffff8156b4b4>] ceph_auth_create_authorizer+0x54/0x80 [ 28.688088] [<ffffffff8155f7c0>] get_authorizer+0x80/0xd0 [ 28.688088] [<ffffffff81555a8b>] prepare_write_connect+0x18b/0x2b0 [ 28.688088] [<ffffffff81559289>] try_read+0x1e59/0x1f10 This is because we set up crypto scatterlists as if all buffers were kmalloc'ed. Fix it. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-11-13 22:21:12 +03:00
Ilya Dryomov	e9226d7c9f	libceph: eliminate unnecessary allocation in process_one_ticket() Commit `c27a3e4d66` ("libceph: do not hard code max auth ticket len") while fixing a buffer overlow tried to keep the same as much of the surrounding code as possible and introduced an unnecessary kmalloc() in the unencrypted ticket path. It is likely to fail on huge tickets, so get rid of it. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-31 23:43:08 +03:00
Mike Christie	89baaa570a	libceph: use memalloc flags for net IO This patch has ceph's lib code use the memalloc flags. If the VM layer needs to write data out to free up memory to handle new allocation requests, the block layer must be able to make forward progress. To handle that requirement we use structs like mempools to reserve memory for objects like bios and requests. The problem is when we send/receive block layer requests over the network layer, net skb allocations can fail and the system can lock up. To solve this, the memalloc related flags were added. NBD, iSCSI and NFS uses these flags to tell the network/vm layer that it should use memory reserves to fullfill allcation requests for structs like skbs. I am running ceph in a bunch of VMs in my laptop, so this patch was not tested very harshly. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>	2014-10-30 13:11:50 +03:00
Linus Torvalds	6b04908166	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is the long-awaited discard support for RBD (Guangliang Zhao, Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and performance and debugging improvements (Yan, Zheng, John Spray), and a smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) ceph: fix divide-by-zero in __validate_layout() rbd: rbd workqueues need a resque worker libceph: ceph-msgr workqueue needs a resque worker ceph: fix bool assignments libceph: separate multiple ops with commas in debugfs output libceph: sync osd op definitions in rados.h libceph: remove redundant declaration ceph: additional debugfs output ceph: export ceph_session_state_name function ceph: include the initial ACL in create/mkdir/mknod MDS requests ceph: use pagelist to present MDS request data libceph: reference counting pagelist ceph: fix llistxattr on symlink ceph: send client metadata to MDS ceph: remove redundant code for max file size verification ceph: remove redundant io_iter_advance() ceph: move ceph_find_inode() outside the s_mutex ceph: request xattrs if xattr_version is zero rbd: set the remaining discard properties to enable support rbd: use helpers to handle discard for layered images correctly ...	2014-10-15 06:46:01 +02:00
Ilya Dryomov	f9865f06f7	libceph: ceph-msgr workqueue needs a resque worker Commit `f363e45fd1` ("net/ceph: make ceph_msgr_wq non-reentrant") effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is wrong - libceph is very much a memory reclaim path, so restore it. Cc: stable@vger.kernel.org # needs backporting for < 3.12 Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Tested-by: Micha Krause <micha@krausam.de> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:04 -07:00
Ilya Dryomov	25f897773b	libceph: separate multiple ops with commas in debugfs output For requests with multiple ops, separate ops with commas instead of \t, which is a field separator here. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:03 -07:00
Ilya Dryomov	70b5bfa360	libceph: sync osd op definitions in rados.h Bring in missing osd ops and strings, use macros to eliminate multiple points of maintenance. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:57:02 -07:00
Yan, Zheng	e4339d28f6	libceph: reference counting pagelist this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>	2014-10-14 12:56:48 -07:00
Ilya Dryomov	91883cd27c	libceph: don't try checking queue_work() return value queue_work() doesn't "fail to queue", it returns false if work was already on a queue, which can't happen here since we allocate event_work right before we queue it. So don't bother at all. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>	2014-10-14 21:03:25 +04:00

1 2 3 4 5 ...

656 Commits