linux_old1

Commit Graph

Author	SHA1	Message	Date
Ming Lei	744889b7cb	block: don't deal with discard limit in blkdev_issue_discard() blk_queue_split() does respect this limit via bio splitting, so no need to do that in blkdev_issue_discard(), then we can align to normal bio submit(bio_add_page() & submit_bio()). More importantly, this patch fixes one issue introduced in `a22c4d7e34` ("block: re-add discard_granularity and alignment checks"), in which zero discard bio may be generated in case of zero alignment. Fixes: `a22c4d7e34` ("block: re-add discard_granularity and alignment checks") Cc: stable@vger.kernel.org Cc: Ming Lin <ming.l@ssi.samsung.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Xiao Ni <xni@redhat.com> Tested-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-10-18 07:23:40 -06:00
Josef Bacik	5e65a20341	blk-wbt: wake up all when we scale up, not down Tetsuo brought to my attention that I screwed up the scale_up/scale_down helpers when I factored out the rq-qos code. We need to wake up all the waiters when we add slots for requests to make, not when we shrink the slots. Otherwise we'll end up things waiting forever. This was a mistake and simply puts everything back the way it was. cc: stable@vger.kernel.org Fixes: `a79050434b` ("blk-rq-qos: refactor out common elements of blk-wbt") eported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-10-11 13:31:28 -06:00
Ilya Dryomov	587562d0c7	blk-mq: I/O and timer unplugs are inverted in blktrace trace_block_unplug() takes true for explicit unplugs and false for implicit unplugs. schedule() unplugs are implicit and should be reported as timer unplugs. While correct in the legacy code, this has been inverted in blk-mq since 4.11. Cc: stable@vger.kernel.org Fixes: `bd166ef183` ("blk-mq-sched: add framework for MQ capable IO schedulers") Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-27 13:12:44 -06:00
Damien Le Moal	854f31ccdd	block: fix deadline elevator drain for zoned block devices When the deadline scheduler is used with a zoned block device, writes to a zone will be dispatched one at a time. This causes the warning message: deadline: forced dispatching is broken (nr_sorted=X), please report this to be displayed when switching to another elevator with the legacy I/O path while write requests to a zone are being retained in the scheduler queue. Prevent this message from being displayed when executing elv_drain_elevator() for a zoned block device. __blk_drain_queue() will loop until all writes are dispatched and completed, resulting in the desired elevator queue drain without extensive modifications to the deadline code itself to handle forced-dispatch calls. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Fixes: `8dc8146f9c` ("deadline-iosched: Introduce zone locking support") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-26 19:57:24 -06:00
Keith Busch	530ca2c9bd	blk-mq: Allow blocking queue tag iter callbacks A recent commit runs tag iterator callbacks under the rcu read lock, but existing callbacks do not satisfy the non-blocking requirement. The commit intended to prevent an iterator from accessing a queue that's being modified. This patch fixes the original issue by taking a queue reference instead of reading it, which allows callbacks to make blocking calls. Fixes: `f5bbbbe4d6` ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter") Acked-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-25 20:17:59 -06:00
Omar Sandoval	b57e99b4b8	block: use nanosecond resolution for iostat Klaus Kusche reported that the I/O busy time in /proc/diskstats was not updating properly on 4.18. This is because we started using ktime to track elapsed time, and we convert nanoseconds to jiffies when we update the partition counter. However, this gets rounded down, so any I/Os that take less than a jiffy are not accounted for. Previously in this case, the value of jiffies would sometimes increment while we were doing I/O, so at least some I/Os were accounted for. Let's convert the stats to use nanoseconds internally. We still report milliseconds as before, now more accurately than ever. The value is still truncated to 32 bits for backwards compatibility. Fixes: `522a777566` ("block: consolidate struct request timestamp fields") Cc: stable@vger.kernel.org Reported-by: Klaus Kusche <klaus.kusche@computerix.info> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-21 20:26:59 -06:00
Jens Axboe	01c5f85aeb	blk-cgroup: increase number of supported policies After merging the iolatency policy, we potentially now have 4 policies being registered, but only support 3. This causes one of them to fail loading. Takashi reports that BFQ no longer works for him, because it fails to load due to policy registration failure. Bump to 5 policies, and also add a warning for when we have exceeded the global amount. If we have to touch this again, we should switch to a dynamic scheme instead. Reported-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-11 10:59:53 -06:00
Konstantin Khlebnikov	d5274b3cd6	block: bfq: swap puts in bfqg_and_blkg_put Fix trivial use-after-free. This could be last reference to bfqg. Fixes: `8f9bebc33d` ("block, bfq: access and cache blkg data only when safe") Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-06 11:32:58 -06:00
Mikulas Patocka	8b2ded1c94	block: don't warn when doing fsync on read-only devices It is possible to call fsync on a read-only handle (for example, fsck.ext2 does it when doing read-only check), and this call results in kernel warning. The patch `b089cfd95d` ("block: don't warn for flush on read-only device") attempted to disable the warning, but it is buggy and it doesn't (op_is_flush tests flags, but bio_op strips off the flags). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `721c7fc701` ("block: fail op_is_write() requests to read-only partitions") Cc: stable@vger.kernel.org # 4.18 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-09-05 16:14:36 -06:00
Dennis Zhou (Facebook)	3111885015	blkcg: use tryget logic when associating a blkg with a bio There is a very small change a bio gets caught up in a really unfortunate race between a task migration, cgroup exiting, and itself trying to associate with a blkg. This is due to css offlining being performed after the css->refcnt is killed which triggers removal of blkgs that reach their blkg->refcnt of 0. To avoid this, association with a blkg should use tryget and fallback to using the root_blkg. Fixes: `08e18eab0c` ("block: add bi_blkg to the bio for cgroups") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-31 14:48:58 -06:00
Dennis Zhou (Facebook)	59b57717ff	blkcg: delay blkg destruction until after writeback has finished Currently, blkcg destruction relies on a sequence of events: 1. Destruction starts. blkcg_css_offline() is called and blkgs release their reference to the blkcg. This immediately destroys the cgwbs (writeback). 2. With blkgs giving up their reference, the blkcg ref count should become zero and eventually call blkcg_css_free() which finally frees the blkcg. Jiufei Xue reported that there is a race between blkcg_bio_issue_check() and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent on the completion of all writeback associated with the blkcg. A count of the number of cgwbs is maintained and once that goes to zero, blkg destruction can follow. This should prevent premature blkg destruction related to writeback. The new process for blkcg cleanup is as follows: 1. Destruction starts. blkcg_css_offline() is called which offlines writeback. Blkg destruction is delayed on the cgwb_refcnt count to avoid punting potentially large amounts of outstanding writeback to root while maintaining any ongoing policies. Here, the base cgwb_refcnt is put back. 2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called and handles destruction of blkgs. This is where the css reference held by each blkg is released. 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called. This finally frees the blkg. It seems in the past blk-throttle didn't do the most understandable things with taking data from a blkg while associating with current. So, the simplification and unification of what blk-throttle is doing caused this. Fixes: `08e18eab0c` ("block: add bi_blkg to the bio for cgroups") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-31 14:48:56 -06:00
Dennis Zhou (Facebook)	6b06546206	Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" This reverts commit `4c6994806f`. Destroying blkgs is tricky because of the nature of the relationship. A blkg should go away when either a blkcg or a request_queue goes away. However, blkg's pin the blkcg to ensure they remain valid. To break this cycle, when a blkcg is offlined, blkgs put back their css ref. This eventually lets css_free() get called which frees the blkcg. The above commit (`4c6994806f`) breaks this order of events by trying to destroy blkgs in css_free(). As the blkgs still hold references to the blkcg, css_free() is never called. The race between blkcg_bio_issue_check() and cgroup_rmdir() will be addressed in the following patch by delaying destruction of a blkg until all writeback associated with the blkcg has been finished. Fixes: `4c6994806f` ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-31 14:48:54 -06:00
John Pittman	db193954ed	block: bsg: move atomic_t ref_count variable to refcount API Currently, variable ref_count within the bsg_device struct is of type atomic_t. For variables being used as reference counters, the refcount API should be used instead of atomic. The newer refcount API works to prevent counter overflows and use-after-free bugs. So, move this varable from the atomic API to refcount, potentially avoiding the issues mentioned. Signed-off-by: John Pittman <jpittman@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-27 19:17:02 -06:00
Chengguang Xu	62d2a19407	block: remove unnecessary condition check kmem_cache_destroy() can handle NULL pointer correctly, so there is no need to check e->icq_cache before calling kmem_cache_destroy(). Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-27 19:16:06 -06:00
Jens Axboe	b0a84beb2e	blk-wbt: remove dead code We already note and mark discard and swap IO from bio_to_wbt_flags(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-27 13:32:12 -06:00
Jens Axboe	38cfb5a45e	blk-wbt: improve waking of tasks We have two potential issues: 1) After commit `2887e41b91`, we only wake one process at the time when we finish an IO. We really want to wake up as many tasks as can queue IO. Before this commit, we woke up everyone, which could cause a thundering herd issue. 2) A task can potentially consume two wakeups, causing us to (in practice) miss a wakeup. Fix both by providing our own wakeup function, which stops __wake_up_common() from waking up more tasks if we fail to get a queueing token. With the strict ordering we have on the wait list, this wakes the right tasks and the right amount of tasks. Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>. Tested-by: Agarwal, Anchal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-27 11:27:26 -06:00
Jens Axboe	061a542753	blk-wbt: abstract out end IO completion handler Prep patch for calling the handler from a different context, no functional changes in this patch. Tested-by: Agarwal, Anchal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-27 11:27:24 -06:00
Jens Axboe	c125311d96	blk-wbt: don't maintain inflight counts if disabled A previous commit removed the ability to have per-rq flags. We used those flags to maintain inflight counts. Since we don't have those anymore, we have to always maintain inflight counts, even if wbt is disabled. This is clearly suboptimal. Add a queue quiesce around changing the wbt latency settings from sysfs to work around this. With that, we can reliably put the enabled check in our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be consistent for the lifetime of the request. Fixes: `c1c80384c8` ("block: remove external dependency on wbt_flags") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-23 09:34:46 -06:00
Jens Axboe	c45e6a037a	blk-wbt: fix has-sleeper queueing check We need to do this inside the loop as well, or we can allow new IO to supersede previous IO. Tested-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-22 15:07:32 -06:00
Jens Axboe	b78820937b	blk-wbt: use wq_has_sleeper() for wq active check We need the memory barrier before checking the list head, use the appropriate helper for this. The matching queue side memory barrier is provided by set_current_state(). Tested-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-22 15:07:31 -06:00
Jens Axboe	ffa358dcaa	blk-wbt: move disable check into get_limit() Check it in one place, instead of in multiple places. Tested-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-22 15:07:31 -06:00
Linus Torvalds	5bed49adfe	for-4.19/post-20180822 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/ AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+ zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F nbLjRM8L6g== =FS7e -----END PGP SIGNATURE----- Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: - Set of bcache fixes and changes (Coly) - The flush warn fix (me) - Small series of BFQ fixes (Paolo) - wbt hang fix (Ming) - blktrace fix (Steven) - blk-mq hardware queue count update fix (Jianchao) - Various little fixes * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits) block/DAC960.c: make some arrays static const, shrinks object size blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter blk-mq: init hctx sched after update ctx and hctx mapping block: remove duplicate initialization tracing/blktrace: Fix to allow setting same value pktcdvd: fix setting of 'ret' error return for a few cases block: change return type to bool block, bfq: return nbytes and not zero from struct cftype .write() method block, bfq: improve code of bfq_bfqq_charge_time block, bfq: reduce write overcharge block, bfq: always update the budget of an entity when needed block, bfq: readd missing reset of parent-entity service blk-wbt: fix IO hang in wbt_wait() block: don't warn for flush on read-only device bcache: add the missing comments for smp_mb()/smp_wmb() bcache: remove unnecessary space before ioctl function pointer arguments bcache: add missing SPDX header bcache: move open brace at end of function definitions to next line bcache: add static const prefix to char * array declarations bcache: fix code comments style ...	2018-08-22 13:38:05 -07:00
Jianchao Wang	f5bbbbe4d6	blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to account the inflight requests. It will access the queue_hw_ctx and nr_hw_queues w/o any protection. When updating nr_hw_queues and blk_mq_in_flight/rw occur concurrently, panic comes up. Before update nr_hw_queues, the q will be frozen. So we could use q_usage_counter to avoid the race. percpu_ref_is_zero is used here so that we will not miss any in-flight request. The access to nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are under rcu critical section, __blk_mq_update_nr_hw_queues could use synchronize_rcu to ensure the zeroed q_usage_counter to be globally visible. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-21 09:02:56 -06:00
Jianchao Wang	d48ece209f	blk-mq: init hctx sched after update ctx and hctx mapping Currently, when update nr_hw_queues, IO scheduler's init_hctx will be invoked before the mapping between ctx and hctx is adapted correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber) may depend on this mapping and get wrong result and panic finally. A simply way to fix this is that switch the IO scheduler to 'none' before update the nr_hw_queues, and then switch it back after update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due to nobody use them any more. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-21 09:02:55 -06:00
Chaitanya Kulkarni	fcedba42d9	block: remove duplicate initialization This patch removes the duplicate initialization of q->queue_head in the blk_alloc_queue_node(). This removes the 2nd initialization so that we preserve the initialization order same as declaration present in struct request_queue. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-17 12:33:36 -06:00
Chengguang Xu	599d067dd3	block: change return type to bool Because blk_do_io_stat() only does a judgement about the request contributes to IO statistics, it better changes return type to bool. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:44:17 -06:00
Maciej S. Szmigiero	fc8ebd01de	block, bfq: return nbytes and not zero from struct cftype .write() method The value that struct cftype .write() method returns is then directly returned to userspace as the value returned by write() syscall, so it should be the number of bytes actually written (or consumed) and not zero. Returning zero from write() syscall makes programs like /bin/echo or bash spin. Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Fixes: `e21b7a0b98` ("block, bfq: add full hierarchical scheduling and cgroups support") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:11:16 -06:00
Paolo Valente	f812164869	block, bfq: improve code of bfq_bfqq_charge_time bfq_bfqq_charge_time contains some lengthy and redundant code. This commit trims and condenses that code. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:15 -06:00
Paolo Valente	d5801088a7	block, bfq: reduce write overcharge When a sync request is dispatched, the queue that contains that request, and all the ancestor entities of that queue, are charged with the number of sectors of the request. In constrast, if the request is async, then the queue and its ancestor entities are charged with the number of sectors of the request, multiplied by an overcharge factor. This throttles the bandwidth for async I/O, w.r.t. to sync I/O, and it is done to counter the tendency of async writes to steal I/O throughput to reads. On the opposite end, the lower this parameter, the stabler I/O control, in the following respect. The lower this parameter is, the less the bandwidth enjoyed by a group decreases - when the group does writes, w.r.t. to when it does reads; - when other groups do reads, w.r.t. to when they do writes. The fixes "block, bfq: always update the budget of an entity when needed" and "block, bfq: readd missing reset of parent-entity service" improved I/O control in bfq to such an extent that it has been possible to revise this overcharge factor downwards. This commit introduces the resulting, new value. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:13 -06:00
Paolo Valente	e02a0aa26b	block, bfq: always update the budget of an entity when needed When the next child entity to serve changes for a given parent entity, the budget of that parent entity must be updated accordingly. Unfortunately, this update is not performed, by mistake, for the entities that happen to switch from having no child entity to serve, to having one child entity to serve. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:12 -06:00
Paolo Valente	8a511ba5fe	block, bfq: readd missing reset of parent-entity service The received-service counter needs to be equal to 0 when an entity is set in service. Unfortunately, commit "block, bfq: fix service being wrongly set to zero in case of preemption" mistakenly removed the resetting of this counter for the parent entities of the bfq_queue being set in service. This commit fixes this issue by resetting service for parent entities, directly on the expiration of the in-service bfq_queue. Fixes: `9fae8dd59f` ("block, bfq: fix service being wrongly set to zero in case of preemption") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-16 13:08:10 -06:00
Linus Torvalds	73ba2fb33c	for-4.19/block-20180812 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5 Zolos5H+JA== =b9ib -----END PGP SIGNATURE----- Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...	2018-08-14 10:23:25 -07:00
Ming Lei	df60f6e835	blk-wbt: fix IO hang in wbt_wait() On wbt invariant is that if one IO is tracked via WBT_TRACKED, rqw->inflight should be updated for tracking this IO. But commit `c1c80384c8` ("block: remove external dependency on wbt_flags") forgets to remove the early handling of !rwb_enabled(rwb) inside wbt_wait(), then the inflight counter may not be increased in wbt_wait(), but decreased in wbt_done() for this kind of IO, so this counter may become negative, then wbt_wait() may wait forever. This patch fixes the report in the following link: https://marc.info/?l=linux-block&m=153221542021033&w=2 Fixes: `c1c80384c8` ("block: remove external dependency on wbt_flags") Cc: Josef Bacik <jbacik@fb.com> Reported-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-14 11:05:52 -06:00
Jens Axboe	b089cfd95d	block: don't warn for flush on read-only device Don't warn for a flush issued to a read-only device. It's not strictly a writable command, as it doesn't change any on-media data by itself. Reported-by: Stefan Agner <stefan@agner.ch> Fixes: `721c7fc701` ("block: fail op_is_write() requests to read-only partitions") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-14 10:52:40 -06:00
Bart Van Assche	b86d865cb1	blkcg: Make blkg_root_lookup() work for queues in bypass mode For legacy queues the only call of blkg_root_lookup() happens after bypass mode has been enabled. Since blkg_lookup() returns NULL for queues in bypass mode, modify the blkg_root_lookup() such that it no longer depends on bypass mode. Rename the function into blk_queue_root_blkg() as suggested by Tejun. Suggested-by: Tejun Heo <tj@kernel.org> Fixes: `6bad9b210a` ("blkcg: Introduce blkg_root_lookup()") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-11 15:41:25 -06:00
Liu Bo	991f61fe7e	Blk-throttle: reduce tail io latency when iops limit is enforced When an application's iops has exceeded its cgroup's iops limit, surely it is throttled and kernel will set a timer for dispatching, thus IO latency includes the delay. However, the dispatch delay which is calculated by the limit and the elapsed jiffies is suboptimal. As the dispatch delay is only calculated once the application's iops is (iops limit + 1), it doesn't need to wait any longer than the remaining time of the current slice. The difference can be proved by the following fio job and cgroup iops setting, ----- $ echo 4 > /mnt/config/nullb/disk1/mbps # limit nullb's bandwidth to 4MB/s for testing. $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max $ cat r2.job [global] name=fio-rand-read filename=/dev/nullb1 rw=randread bs=4k direct=1 numjobs=1 time_based=1 runtime=60 group_reporting=1 [file1] size=4G ioengine=libaio iodepth=1 rate_iops=50000 norandommap=1 thinktime=4ms ----- wo patch: file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.7-66-gedfc Starting 1 process read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec) slat (usec): min=10, max=336, avg=27.71, stdev=17.82 clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29 lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22 clat percentiles (usec): \| 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], \| 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 6], 60.00th=[11731], \| 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676], \| 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035], \| 99.99th=[28967] w/ patch: file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.7-66-gedfc Starting 1 process read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec) slat (usec): min=10, max=155, avg=23.24, stdev=16.79 clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25 lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92 clat percentiles (usec): \| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 4], 20.00th=[ 4], \| 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 47], 60.00th=[11863], \| 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994], \| 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125], \| 99.99th=[12387] Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 12:43:16 -06:00
Bart Van Assche	24ecc35853	block: Ensure that a request queue is dissociated from the cgroup controller Several block drivers call alloc_disk() followed by put_disk() if something fails before device_add_disk() is called without calling blk_cleanup_queue(). Make sure that also for this scenario a request queue is dissociated from the cgroup controller. This patch avoids that loading the parport_pc, paride and pf drivers triggers the following kernel crash: BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] Read of size 4 at addr 0000000000000008 by task modprobe/744 Call Trace: dump_stack+0x9a/0xeb kasan_report+0x139/0x350 pi_init+0x42e/0x580 [paride] pf_init+0x2bb/0x1000 [pf] do_one_initcall+0x8e/0x405 do_init_module+0xd9/0x2f2 load_module+0x3ab4/0x4700 SYSC_finit_module+0x176/0x1a0 do_syscall_64+0xee/0x2b0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Reported-by: Alexandru Moise <00moses.alexander00@gmail.com> Fixes: `a063057d7c` ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17 Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Alexandru Moise <00moses.alexander00@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Alexandru Moise <00moses.alexander00@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 09:13:00 -06:00
Bart Van Assche	4cf6324b17	block: Introduce blk_exit_queue() This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Alexandru Moise <00moses.alexander00@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 09:12:59 -06:00
Jianchao Wang	d263ed9926	blk-mq: count the hctx as active before allocating tag Currently, we count the hctx as active after allocate driver tag successfully. If a previously inactive hctx try to get tag first time, it may fails and need to wait. However, due to the stale tag ->active_queues, the other shared-tags users are still able to occupy all driver tags while there is someone waiting for tag. Consequently, even if the previously inactive hctx is waked up, it still may not be able to get a tag and could be starved. To fix it, we count the hctx as active before try to allocate driver tag, then when it is waiting the tag, the other shared-tag users will reserve budget for it. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 08:34:17 -06:00
Greg Edwards	d6c02a9beb	block: bvec_nr_vecs() returns value for wrong slab In commit `ed996a52c8` ("block: simplify and cleanup bvec pool handling"), the value of the slab index is incremented by one in bvec_alloc() after the allocation is done to indicate an index value of 0 does not need to be later freed. bvec_nr_vecs() was not updated accordingly, and thus returns the wrong value. Decrement idx before performing the lookup. Fixes: `ed996a52c8` ("block: simplify and cleanup bvec pool handling") Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-09 08:25:04 -06:00
Bart Van Assche	f7ecb1b109	cfq: Suppress compiler warnings about comparisons This patch does not change any functionality but avoids that gcc reports the following warnings when building with W=1: block/cfq-iosched.c: In function ?cfq_back_seek_max_store?: block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION? STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0); ^~~~~~~~~~~~~~ block/cfq-iosched.c: In function ?cfq_slice_idle_store?: block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION? STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1); ^~~~~~~~~~~~~~ block/cfq-iosched.c: In function ?cfq_group_idle_store?: block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION? STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1); ^~~~~~~~~~~~~~ block/cfq-iosched.c: In function ?cfq_low_latency_store?: block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION? STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0); ^~~~~~~~~~~~~~ block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?: block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION? USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX); ^~~~~~~~~~~~~~~~~~~ block/cfq-iosched.c: In function ?cfq_group_idle_us_store?: block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] if (__data < (MIN)) \ ^ block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION? USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX); ^~~~~~~~~~~~~~~~~~~ Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-07 17:57:13 -06:00
Bart Van Assche	9b4f43460d	cfq: Annotate fall-through in a switch statement This patch avoids that gcc complains about fall-through when building with W=1. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-07 17:57:11 -06:00
Anchal Agarwal	2887e41b91	blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait I am currently running a large bare metal instance (i3.metal) on EC2 with 72 cores, 512GB of RAM and NVME drives, with a 4.18 kernel. I have a workload that simulates a database workload and I am running into lockup issues when writeback throttling is enabled,with the hung task detector also kicking in. Crash dumps show that most CPUs (up to 50 of them) are all trying to get the wbt wait queue lock while trying to add themselves to it in __wbt_wait (see stack traces below). [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000 [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0 [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046 [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00 [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68 [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000 [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002 [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000 [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0 [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.948138] Call Trace: [ 0.948139] <IRQ> [ 0.948142] do_raw_spin_lock+0xad/0xc0 [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b [ 0.948149] ? __wake_up_common_lock+0x53/0x90 [ 0.948150] __wake_up_common_lock+0x53/0x90 [ 0.948155] wbt_done+0x7b/0xa0 [ 0.948158] blk_mq_free_request+0xb7/0x110 [ 0.948161] __blk_mq_complete_request+0xcb/0x140 [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme] [ 0.948169] nvme_irq+0x23/0x50 [nvme] [ 0.948173] __handle_irq_event_percpu+0x46/0x300 [ 0.948176] handle_irq_event_percpu+0x20/0x50 [ 0.948179] handle_irq_event+0x34/0x60 [ 0.948181] handle_edge_irq+0x77/0x190 [ 0.948185] handle_irq+0xaf/0x120 [ 0.948188] do_IRQ+0x53/0x110 [ 0.948191] common_interrupt+0x87/0x87 [ 0.948192] </IRQ> .... [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000 [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0 [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046 [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00 [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68 [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000 [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68 [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00 [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000 [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0 [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.311154] Call Trace: [ 0.311157] do_raw_spin_lock+0xad/0xc0 [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0 [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0 [ 0.311167] wbt_wait+0x127/0x330 [ 0.311169] ? finish_wait+0x80/0x80 [ 0.311172] ? generic_make_request+0xda/0x3b0 [ 0.311174] blk_mq_make_request+0xd6/0x7b0 [ 0.311176] ? blk_queue_enter+0x24/0x260 [ 0.311178] ? generic_make_request+0xda/0x3b0 [ 0.311181] generic_make_request+0x10c/0x3b0 [ 0.311183] ? submit_bio+0x5c/0x110 [ 0.311185] submit_bio+0x5c/0x110 [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4] [ 0.311210] ext4_io_submit+0x48/0x60 [ext4] [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4] [ 0.311229] ? do_writepages+0x3c/0xd0 [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4] [ 0.311240] do_writepages+0x3c/0xd0 [ 0.311243] ? _raw_spin_unlock+0x24/0x30 [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280 [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311253] file_write_and_wait_range+0x34/0x90 [ 0.311264] ext4_sync_file+0x151/0x500 [ext4] [ 0.311267] do_fsync+0x38/0x60 [ 0.311270] SyS_fsync+0xc/0x10 [ 0.311272] do_syscall_64+0x6f/0x170 [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7 In the original patch, wbt_done is waking up all the exclusive processes in the wait queue, which can cause a thundering herd if there is a large number of writer threads in the queue. The original intention of the code seems to be to wake up one thread only however, it uses wake_up_all() in __wbt_done(), and then uses the following check in __wbt_wait to have only one thread actually get out of the wait loop: if (waitqueue_active(&rqw->wait) && rqw->wait.head.next != &wait->entry) return false; The problem with this is that the wait entry in wbt_wait is define with DEFINE_WAIT, which uses the autoremove wakeup function. That means that the above check is invalid - the wait entry will have been removed from the queue already by the time we hit the check in the loop. Secondly, auto-removing the wait entries also means that the wait queue essentially gets reordered "randomly" (e.g. threads re-add themselves in the order they got to run after being woken up). Additionally, new requests entering wbt_wait might overtake requests that were queued earlier, because the wait queue will be (temporarily) empty after the wake_up_all, so the waitqueue_active check will not stop them. This can cause certain threads to starve under high load. The fix is to leave the woken up requests in the queue and remove them in finish_wait() once the current thread breaks out of the wait loop in __wbt_wait. This will ensure new requests always end up at the back of the queue, and they won't overtake requests that are already in the wait queue. With that change, the loop in wbt_wait is also in line with many other wait loops in the kernel. Waking up just one thread drastically reduces lock contention, as does moving the wait queue add/remove out of the loop. A significant drop in lockdep's lock contention numbers is seen when running the test application on the patched kernel. Signed-off-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-07 14:40:49 -06:00
Jens Axboe	05b9ba4b55	Linux 4.18-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt vJJlibI= =42a5 -----END PGP SIGNATURE----- Merge tag 'v4.18-rc6' into for-4.19/block2 Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>	2018-08-05 19:32:09 -06:00
Linus Torvalds	a32e236eb9	Partially revert "block: fail op_is_write() requests to read-only partitions" It turns out that commit `721c7fc701` ("block: fail op_is_write() requests to read-only partitions"), while obviously correct, causes problems for some older lvm2 installations. The reason is that the lvm snapshotting will continue to write to the snapshow COW volume, even after the volume has been marked read-only. End result: snapshot failure. This has actually been fixed in newer version of the lvm2 tool, but the old tools still exist, and the breakage was reported both in the kernel bugzilla and in the Debian bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200439 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900442 The lvm2 fix is here https://sourceware.org/git/?p=lvm2.git;a=commit;h=a6fdb9d9d70f51c49ad11a87ab4243344e6701a3 but until everybody has updated to recent versions, we'll have to weaken the "never write to read-only partitions" check. It now allows the write to happen, but causes a warning, something like this: generic_make_request: Trying to write to read-only block-device dm-3 (partno X) Modules linked in: nf_tables xt_cgroup xt_owner kvm_intel iwlmvm kvm irqbypass iwlwifi CPU: 1 PID: 77 Comm: kworker/1:1 Not tainted 4.17.9-gentoo #3 Hardware name: LENOVO 20B6A019RT/20B6A019RT, BIOS GJET91WW (2.41 ) 09/21/2016 Workqueue: ksnaphd do_metadata RIP: 0010:generic_make_request_checks+0x4ac/0x600 ... Call Trace: generic_make_request+0x64/0x400 submit_bio+0x6c/0x140 dispatch_io+0x287/0x430 sync_io+0xc3/0x120 dm_io+0x1f8/0x220 do_metadata+0x1d/0x30 process_one_work+0x1b9/0x3e0 worker_thread+0x2b/0x3c0 kthread+0x113/0x130 ret_from_fork+0x35/0x40 Note that this is a "revert" in behavior only. I'm leaving alone the actual code cleanups in commit `721c7fc701`, but letting the previously uncaught request go through with a warning instead of stopping it. Fixes: `721c7fc701` ("block: fail op_is_write() requests to read-only partitions") Reported-and-tested-by: WGH <wgh@torlan.ru> Acked-by: Mike Snitzer <snitzer@redhat.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-04 18:19:55 -07:00
Linus Torvalds	71abe04269	for-linus-20180803 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltkcI4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuByD/9Pf8mfsv/WDJposa5pRqnY18dtBNdfHjsb t6vuBZiLxe86LDo1Y3e7663u+kgraARyswUpNTv5iAKseVkwIcz0RTArQJkaDLy7 AD2BEfLvpcacAoWZ96MiayjQabGXEnTWqiKnIlcbc6J4/LIfdYYZHd+SXUXH0R81 BaXPANL222srxuGbBifsfZ/yXxNPmBYfT1mTaFU6cKAftVO/24RV/MbwkU4cgym7 0ZM/iBDkFDy6eV3XhhGeX1miDhGa2zNM/NHvsnSsP8jQqlyJW+lpI5dRPEoH1Fzp ecpxSH/HtjW5GYMwP7qhHMth/XHhjkOmNDRblkRWYrMGxgy7mAcSR8axdjqTdDxD +vlez+b9uPF3qXUk1D2NkW3RmTSkxjV6ztzc+yddbFSzqmQcHSqH+7ohaHnQRefX IGngTOq5pSmluNmrul48y/wmkSwVI1/jrteOfJfhJXWL0pY65g+8KWQxLmeZHke7 ytFu6msOsVmZjDQ7tckSxSLMB+frvzuc/h+eOBTx8ida0bjopVlfNUcLQRD4kZfy xHn6nPxGsdiq2PM71nxYvqoUfmZnIE3o0A9+IB4OIp5bLVAGt2/fbWV3llunGK0A JM8UmyOysh71xh288VyE/hPz/7tyea/KgTaTm0jdTkTeoHH2isHsgWWvYB3zjeEn /1c7o2T13Q== =Ml3z -----END PGP SIGNATURE----- Merge tag 'for-linus-20180803' of git://git.kernel.dk/linux-block Pull block fix from Jens Axboe: "Just a single fix, from Ming, fixing a regression in this cycle where the busy tag iteration was changed to only calling the callback function for requests that are started. We really want all non-free requests. This fixes a boot regression on certain VM setups" * tag 'for-linus-20180803' of git://git.kernel.dk/linux-block: blk-mq: fix blk_mq_tagset_busy_iter	2018-08-03 10:43:56 -07:00
Ming Lei	2d5ba0e2de	blk-mq: fix blk_mq_tagset_busy_iter Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT' to replace 'blk_mq_request_started(req)', this way is wrong, and causes lots of test system hang during booting. Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter(). Fixes: `d250bf4e77` ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter") Cc: Josef Bacik <josef@toxicpanda.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Mark Brown <broonie@kernel.org> Cc: Matt Hart <matthew.hart@linaro.org> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: John Garry <john.garry@huawei.com> Cc: Hannes Reinecke <hare@suse.com>, Cc: "Martin K. Petersen" <martin.petersen@oracle.com>, Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: linux-scsi@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Mark Brown <broonie@kernel.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-02 14:47:20 -06:00
Ming Lei	75d6e175fc	blk-mq: fix updating tags depth The passed 'nr' from userspace represents the total depth, meantime inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth, and 'nr_reserved_tags' stores the reserved part. There are two issues in blk_mq_tag_update_depth() now: 1) for growing tags, we should have used the passed 'nr', and keep the number of reserved tags not changed. 2) the passed 'nr' should have been used for checking against 'tags->nr_tags', instead of number of the normal part. This patch fixes the above two cases, and avoids kernel crash caused by wrong resizing sbitmap queue. Cc: "Ewan D. Milne" <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Tested by: Marco Patalano <mpatalan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-02 14:41:58 -06:00
Ming Lei	b233f12704	block: really disable runtime-pm for blk-mq Runtime PM isn't ready for blk-mq yet, and commit `765e40b675` ("block: disable runtime-pm for blk-mq") tried to disable it. Unfortunately, it can't take effect in that way since user space still can switch it on via 'echo auto > /sys/block/sdN/device/power/control'. This patch disables runtime-pm for blk-mq really by pm_runtime_disable() and fixes all kinds of PM related kernel crash. Cc: Tomas Janousek <tomi@nomi.cz> Cc: Przemek Socha <soprwa@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: <stable@vger.kernel.org> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-02 10:36:02 -06:00
Dennis Zhou (Facebook)	c480bcf97b	block: make iolatency avg_lat exponentially decay Currently, avg_lat is calculated by accumulating the mean of every window in a long running cumulative average. As time goes on, the metric becomes less and less useful due to the accumulated history. This patch reuses the same calculation done in load averages to make the avg_lat metric more lively. Unlike load averages, the avg only advances when a window elapses (due to an io). Idle periods extend the most recent window. Bucketing is used to limit the history of avg_lat by binding it to the window size. So, the window range for 1/exp (decay rate) is [1 min, 2.5 min) when windows elapse immediately. The current sample window size is exposed in the debug info to enable calculation of the window range. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-02 09:58:14 -06:00

1 2 3 4 5 ...

4013 Commits