mirror of https://gitee.com/openkylin/linux.git
5246 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Tang Junhui | ca71df3166 |
bcache: fix using of loop variable in memory shrink
In bch_mca_scan(), There are some confusion and logical error in the use of loop variables. In this patch, we clarify them as: 1) nr: the number of btree nodes needs to scan, which will decrease after we scan a btree node, and should not be less than 0; 2) i: the number of btree nodes have scanned, includes both btree_cache_freeable and btree_cache, which should not be bigger than btree_cache_used; 3) freed: the number of btree nodes have freed. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | f3641c3abd |
bcache: fix error return value in memory shrink
In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | 688892b3bc |
bcache: fix incorrect sysfs output value of strip size
Stripe size is shown as zero when no strip in back end device: [root@ceph132 ~]# cat /sys/block/sdd/bcache/stripe_size 0.0k Actually it should be 1T Bytes (1 << 31 sectors), but in sysfs interface, stripe_size was changed from sectors to bytes, and move 9 bits left, so the 32 bits variable overflows. This patch change the variable to a 64 bits type before moving bits. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | bc082a55d2 |
bcache: fix inaccurate io state for detached bcache devices
When we run IO in a detached device, and run iostat to shows IO status, normally it will show like bellow (Omitted some fields): Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util sdd ... 15.89 0.53 1.82 0.20 2.23 1.81 52.30 bcache0 ... 15.89 115.42 0.00 0.00 0.00 2.40 69.60 but after IO stopped, there are still very big avgqu-sz and %util values as bellow: Device: ... avgrq-sz avgqu-sz await r_await w_await svctm %util bcache0 ... 0 5326.32 0.00 0.00 0.00 0.00 100.10 The reason for this issue is that, only generic_start_io_acct() called and no generic_end_io_acct() called for detached device in cached_dev_make_request(). See the code: //start generic_start_io_acct() generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0); if (cached_dev_get(dc)) { //will callback generic_end_io_acct() } else { //will not call generic_end_io_acct() } This patch calls generic_end_io_acct() in the end of IO for detached devices, so we can show IO state correctly. (Modified to use GFP_NOIO in kzalloc() by Coly Li) Changelog: v2: fix typo. v1: the initial version. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 7e027ca4b5 |
bcache: add stop_when_cache_set_failed option to backing device
When there are too many I/O errors on cache device, current bcache code will retire the whole cache set, and detach all bcache devices. But the detached bcache devices are not stopped, which is problematic when bcache is in writeback mode. If the retired cache set has dirty data of backing devices, continue writing to bcache device will write to backing device directly. If the LBA of write request has a dirty version cached on cache device, next time when the cache device is re-registered and backing device re-attached to it again, the stale dirty data on cache device will be written to backing device, and overwrite latest directly written data. This situation causes a quite data corruption. But we cannot simply stop all attached bcache devices when the cache set is broken or disconnected. For example, use bcache to accelerate performance of an email service. In such workload, if cache device is broken but no dirty data lost, keep the bcache device alive and permit email service continue to access user data might be a better solution for the cache device failure. Nix <nix@esperi.org.uk> points out the issue and provides the above example to explain why it might be necessary to not stop bcache device for broken cache device. Pavel Goran <via-bcache@pvgoran.name> provides a brilliant suggestion to provide "always" and "auto" options to per-cached device sysfs file stop_when_cache_set_failed. If cache set is retiring and the backing device has no dirty data on cache, it should be safe to keep the bcache device alive. In this case, if stop_when_cache_set_failed is set to "auto", the device failure handling code will not stop this bcache device and permit application to access the backing device with a unattached bcache device. Changelog: [mlyle: edited to not break string constants across lines] v3: fix typos pointed out by Nix. v2: change option values of stop_when_cache_set_failed from 1/0 to "auto"/"always". v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1 (always stop). Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Michael Lyle <mlyle@lyle.org> Cc: Nix <nix@esperi.org.uk> Cc: Pavel Goran <via-bcache@pvgoran.name> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 771f393e8f |
bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
When too many I/Os failed on cache device, bch_cache_set_error() is called in the error handling code path to retire whole problematic cache set. If new I/O requests continue to come and take refcount dc->count, the cache set won't be retired immediately, this is a problem. Further more, there are several kernel thread and self-armed kernel work may still running after bch_cache_set_error() is called. It needs to wait quite a while for them to stop, or they won't stop at all. They also prevent the cache set from being retired. The solution in this patch is, to add per cache set flag to disable I/O request on this cache and all attached backing devices. Then new coming I/O requests can be rejected in *_make_request() before taking refcount, kernel threads and self-armed kernel worker can stop very fast when flags bit CACHE_SET_IO_DISABLE is set. Because bcache also do internal I/Os for writeback, garbage collection, bucket allocation, journaling, this kind of I/O should be disabled after bch_cache_set_error() is called. So closure_bio_submit() is modified to check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set, closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and return, generic_make_request() won't be called. A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit from cache_set->flags, to disable or enable cache set I/O for debugging. It is helpful to trigger more corner case issues for failed cache device. Changelog v4, add wait_for_kthread_stop(), and call it before exits writeback and gc kernel threads. v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index. remove "bcache: " prefix when printing out kernel message. v2, more changes by previous review, - Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui. - Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this is reported and inspired from origal patch of Pavel Vazharov. v1, initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Cc: Michael Lyle <mlyle@lyle.org> Cc: Pavel Vazharov <freakpv@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 3fd47bfe55 |
bcache: stop dc->writeback_rate_update properly
struct delayed_work writeback_rate_update in struct cache_dev is a delayed worker to call function update_writeback_rate() in period (the interval is defined by dc->writeback_rate_update_seconds). When a metadate I/O error happens on cache device, bcache error handling routine bch_cache_set_error() will call bch_cache_set_unregister() to retire whole cache set. On the unregister code path, this delayed work is stopped by calling cancel_delayed_work_sync(&dc->writeback_rate_update). dc->writeback_rate_update is a special delayed work from others in bcache. In its routine update_writeback_rate(), this delayed work is re-armed itself. That means when cancel_delayed_work_sync() returns, this delayed work can still be executed after several seconds defined by dc->writeback_rate_update_seconds. The problem is, after cancel_delayed_work_sync() returns, the cache set unregister code path will continue and release memory of struct cache set. Then the delayed work is scheduled to run, __update_writeback_rate() will reference the already released cache_set memory, and trigger a NULL pointer deference fault. This patch introduces two more bcache device flags, - BCACHE_DEV_WB_RUNNING bit set: bcache device is in writeback mode and running, it is OK for dc->writeback_rate_update to re-arm itself. bit clear:bcache device is trying to stop dc->writeback_rate_update, this delayed work should not re-arm itself and quit. - BCACHE_DEV_RATE_DW_RUNNING bit set: routine update_writeback_rate() is executing. bit clear: routine update_writeback_rate() quits. This patch also adds a function cancel_writeback_rate_update_dwork() to wait for dc->writeback_rate_update quits before cancel it by calling cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected quit dc->writeback_rate_update, after time_out seconds this function will give up and continue to call cancel_delayed_work_sync(). And here I explain how this patch stops self re-armed delayed work properly with the above stuffs. update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING. Before calling cancel_delayed_work_sync() wait utill flag BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling cancel_delayed_work_sync(), dc->writeback_rate_update must be already re- armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases delayed work routine update_writeback_rate() won't be executed after cancel_delayed_work_sync() returns. Inside update_writeback_rate() before calling schedule_delayed_work(), flag BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means someone is about to stop the delayed work. Because flag BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync() has to wait for this flag to be cleared, we don't need to worry about race condition here. If update_writeback_rate() is scheduled to run after checking BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync() in cancel_writeback_rate_update_dwork(), it is also safe. Because at this moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear and quit immediately. Because there are more dependences inside update_writeback_rate() to struct cache_set memory, dc->writeback_rate_update is not a simple self re-arm delayed work. After trying many different methods (e.g. hold dc->count, or use locks), this is the only way I can find which works to properly stop dc->writeback_rate_update delayed work. Changelog: v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING to bit index, for test_bit(). v2: Try to fix the race issue which is pointed out by Junhui. v1: The initial version for review Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Junhui Tang <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | fadd94e05c |
bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", cached_dev_get() is called when creating dc->writeback_thread, and cached_dev_put() is called when exiting dc->writeback_thread. This modification works well unless people detach the bcache device manually by 'echo 1 > /sys/block/bcache<N>/bcache/detach' Because this sysfs interface only calls bch_cached_dev_detach() which wakes up dc->writeback_thread but does not stop it. The reason is, before patch "bcache: fix cached_dev->count usage for bch_cache_set_error()", inside bch_writeback_thread(), if cache is not dirty after writeback, cached_dev_put() will be called here. And in cached_dev_make_request() when a new write request makes cache from clean to dirty, cached_dev_get() will be called there. Since we don't operate dc->count in these locations, refcount d->count cannot be dropped after cache becomes clean, and cached_dev_detach_finish() won't be called to detach bcache device. This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is set inside bch_writeback_thread(). If this bit is set and cache is clean (no existing writeback_keys), break the while-loop, call cached_dev_put() and quit the writeback thread. Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the writeback thread should continue to perform writeback, this is the original design of manually detach. It is safe to do the following check without locking, let me explain why, + if (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) && + (!atomic_read(&dc->has_dirty) || !dc->writeback_running)) { If the kenrel thread does not sleep and continue to run due to conditions are not updated in time on the running CPU core, it just consumes more CPU cycles and has no hurt. This should-sleep-but-run is safe here. We just focus on the should-run-but-sleep condition, which means the writeback thread goes to sleep in mistake while it should continue to run. 1, First of all, no matter the writeback thread is hung or not, kthread_stop() from cached_dev_detach_finish() will wake up it and terminate by making kthread_should_stop() return true. And in normal run time, bit on index BCACHE_DEV_DETACHING is always cleared, the condition !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) is always true and can be ignored as constant value. 2, If one of the following conditions is true, the writeback thread should go to sleep, "!atomic_read(&dc->has_dirty)" or "!dc->writeback_running)" each of them independently controls the writeback thread should sleep or not, let's analyse them one by one. 2.1 condition "!atomic_read(&dc->has_dirty)" If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will call bch_writeback_queue() immediately or call bch_writeback_add() which indirectly calls bch_writeback_queue() too. In bch_writeback_queue(), wake_up_process(dc->writeback_thread) is called. It sets writeback thread's task state to TASK_RUNNING and following an implicit memory barrier, then tries to wake up the writeback thread. In writeback thread, its task state is set to TASK_INTERRUPTIBLE before doing the condition check. If other CPU core sets the TASK_RUNNING state after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread will be scheduled to run very soon because its state is not TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier of wake_up_process() will make sure modification of dc->has_dirty on other CPU core is updated and observed on the CPU core of writeback thread. Therefore the condition check will correctly be false, and continue writeback code without sleeping. 2.2 condition "!dc->writeback_running)" dc->writeback_running can be changed via sysfs file, every time it is modified, a following bch_writeback_queue() is alwasy called. So the change is always observed on the CPU core of writeback thread. If dc->writeback_running is changed from 0 to 1 on other CPU core, this condition check will observe the modification and allow writeback thread to continue to run without sleeping. Now we can see, even without a locking protection, multiple conditions check is safe here, no deadlock or process hang up will happen. I compose a separte patch because that patch "bcache: fix cached_dev->count usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes Reinecke. Also this fix is not trivial and good for a separate patch. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Huijun Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 804f3c6981 |
bcache: fix cached_dev->count usage for bch_cache_set_error()
When bcache metadata I/O fails, bcache will call bch_cache_set_error() to retire the whole cache set. The expected behavior to retire a cache set is to unregister the cache set, and unregister all backing device attached to this cache set, then remove sysfs entries of the cache set and all attached backing devices, finally release memory of structs cache_set, cache, cached_dev and bcache_device. In my testing when journal I/O failure triggered by disconnected cache device, sometimes the cache set cannot be retired, and its sysfs entry /sys/fs/bcache/<uuid> still exits and the backing device also references it. This is not expected behavior. When metadata I/O failes, the call senquence to retire whole cache set is, bch_cache_set_error() bch_cache_set_unregister() bch_cache_set_stop() __cache_set_unregister() <- called as callback by calling clousre_queue(&c->caching) cache_set_flush() <- called as a callback when refcount of cache_set->caching is 0 cache_set_free() <- called as a callback when refcount of catch_set->cl is 0 bch_cache_set_release() <- called as a callback when refcount of catch_set->kobj is 0 I find if kernel thread bch_writeback_thread() quits while-loop when kthread_should_stop() is true and searched_full_index is false, clousre callback cache_set_flush() set by continue_at() will never be called. The result is, bcache fails to retire whole cache set. cache_set_flush() will be called when refcount of closure c->caching is 0, and in function bcache_device_detach() refcount of closure c->caching is released to 0 by clousre_put(). In metadata error code path, function bcache_device_detach() is called by cached_dev_detach_finish(). This is a callback routine being called when cached_dev->count is 0. This refcount is decreased by cached_dev_put(). The above dependence indicates, cache_set_flush() will be called when refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0 when refcount of cache_dev->count is 0. The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount of cache_dev is not decreased properly. In bch_writeback_thread(), cached_dev_put() is called only when searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a there is no dirty data on cache. In most of run time it is correct, but when bch_writeback_thread() quits the while-loop while cache is still dirty, current code forget to call cached_dev_put() before this kernel thread exits. This is why sometimes cache_set_flush() is not executed and cache set fails to be retired. The reason to call cached_dev_put() in bch_writeback_rate() is, when the cache device changes from clean to dirty, cached_dev_get() is called, to make sure during writeback operatiions both backing and cache devices won't be released. Adding following code in bch_writeback_thread() does not work, static int bch_writeback_thread(void *arg) } + if (atomic_read(&dc->has_dirty)) + cached_dev_put() + return 0; } because writeback kernel thread can be waken up and start via sysfs entry: echo 1 > /sys/block/bcache<N>/bcache/writeback_running It is difficult to check whether backing device is dirty without race and extra lock. So the above modification will introduce potential refcount underflow in some conditions. The correct fix is, to take cached dev refcount when creating the kernel thread, and put it before the kernel thread exits. Then bcache does not need to take a cached dev refcount when cache turns from clean to dirty, or to put a cached dev refcount when cache turns from ditry to clean. The writeback kernel thread is alwasy safe to reference data structure from cache set, cache and cached device (because a refcount of cache device is taken for it already), and no matter the kernel thread is stopped by I/O errors or system reboot, cached_dev->count can always be used correctly. The patch is simple, but understanding how it works is quite complicated. Changelog: v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Bart Van Assche | 8b904b5b6b |
block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()
This patch has been generated as follows: for verb in set_unlocked clear_unlocked set clear; do replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \ $(git grep -lw queue_flag_${verb} drivers block/bsg*) done Except for protecting all queue flag changes with the queue lock this patch does not change any functionality. Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Bart Van Assche | 44e1ebe2a3 |
bcache: Use the blk_queue_flag_{set,clear}() functions
Use the blk_queue_flag_{set,clear}() functions instead of open-coding these. Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Bart Van Assche | 5ee0524ba1 |
block: Add 'lock' as third argument to blk_alloc_queue_node()
This patch does not change any functionality. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Bart Van Assche | d8115c35bf |
md: Delete gendisk before cleaning up the request queue
Remove the disk, partition and bdi sysfs attributes before cleaning up the request queue associated with the disk. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | 60eb34ec55 |
bcache: fix kcrashes with fio in RAID5 backend dev
Kernel crashed when run fio in a RAID5 backend bcache device, the call trace is bellow: [ 440.012034] kernel BUG at block/blk-ioc.c:146! [ 440.012696] invalid opcode: 0000 [#1] SMP NOPTI [ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8 [ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16 /2015 [ 440.028615] RIP: 0010:put_io_context+0x8b/0x90 [ 440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246 [ 440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000 0f4240 [ 440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882 94fca0 [ 440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8 0c1700 [ 440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000 01000 [ 440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11 bd1e0 [ 440.035239] FS: 0000000000000000(0000) GS:ffff949cba280000(0000) knlGS: 0000000000000000 [ 440.060190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001 606e0 [ 440.110498] Call Trace: [ 440.135443] bio_disassociate_task+0x1b/0x60 [ 440.160355] bio_free+0x1b/0x60 [ 440.184666] bio_put+0x23/0x30 [ 440.208272] search_free+0x23/0x40 [bcache] [ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache] [ 440.254468] closure_put+0xb6/0xd0 [bcache] [ 440.277087] request_endio+0x30/0x40 [bcache] [ 440.298703] bio_endio+0xa1/0x120 [ 440.319644] handle_stripe+0x418/0x2270 [raid456] [ 440.340614] ? load_balance+0x17b/0x9c0 [ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456] [ 440.380675] ? __release_stripe+0x15/0x20 [raid456] [ 440.400132] raid5d+0x3ed/0x5d0 [raid456] [ 440.419193] ? schedule+0x36/0x80 [ 440.437932] ? schedule_timeout+0x1d2/0x2f0 [ 440.456136] md_thread+0x122/0x150 [ 440.473687] ? wait_woken+0x80/0x80 [ 440.491411] kthread+0x102/0x140 [ 440.508636] ? find_pers+0x70/0x70 [ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0 [ 440.541791] ret_from_fork+0x35/0x40 [ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2 48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41 [ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8 [ 440.628575] ---[ end trace a1fd79d85643a73e ]-- All the crash issue happened when a bypass IO coming, in such scenario s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the s->orig_bio by calling bio_complete(), and after that, s->iop.bio became invalid, then kernel would crash when calling bio_put(). Maybe its upper layer's faulty, since bio should not be freed before we calling bio_put(), but we'd better calling bio_put() first before calling bio_complete() to notify upper layer ending this bio. This patch moves bio_complete() under bio_put() to avoid kernel crash. [mlyle: fixed commit subject for character limits] Reported-by: Matthias Ferdinand <bcache@mfedv.net> Tested-by: Matthias Ferdinand <bcache@mfedv.net> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 02aa8a8b2b |
bcache: correct flash only vols (check all uuids)
Commit |
|
NeilBrown | 8dd601fa83 |
dm: correctly handle chained bios in dec_pending()
dec_pending() is given an error status (possibly 0) to be recorded
against a bio. It can be called several times on the one 'struct
dm_io', and it is careful to only assign a non-zero error to
io->status. However when it then assigned io->status to bio->bi_status,
it is not careful and could overwrite a genuine error status with 0.
This can happen when chained bios are in use. If a bio is chained
beneath the bio that this dm_io is handling, the child bio might
complete and set bio->bi_status before the dm_io completes.
This has been possible since chained bios were introduced in 3.14, and
has become a lot easier to trigger with commit
|
|
Linus Torvalds | a9a08845e9 |
vfs: do bulk POLL* -> EPOLL* replacement
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tang Junhui | 73ac105be3 |
bcache: fix for data collapse after re-attaching an attached device
back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | 7f4fc93d47 |
bcache: return attach error when no cache set exist
I attach a back-end device to a cache set, and the cache set is not registered yet, this back-end device did not attach successfully, and no error returned: [root]# echo 87859280-fec6-4bcc-20df7ca8f86b > /sys/block/sde/bcache/attach [root]# In sysfs_attach(), the return value "v" is initialized to "size" in the beginning, and if no cache set exist in bch_cache_sets, the "v" value would not change any more, and return to sysfs, sysfs regard it as success since the "size" is a positive number. This patch fixes this issue by assigning "v" with "-ENOENT" in the initialization. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 7a5e3ecbe5 |
bcache: set writeback_rate_update_seconds in range [1, 60] seconds
dc->writeback_rate_update_seconds can be set via sysfs and its value can be set to [1, ULONG_MAX]. It does not make sense to set such a large value, 60 seconds is long enough value considering the default 5 seconds works well for long time. Because dc->writeback_rate_update is a special delayed work, it re-arms itself inside the delayed work routine update_writeback_rate(). When stopping it by cancel_delayed_work_sync(), there should be a timeout to wait and make sure the re-armed delayed work is stopped too. A small max value of dc->writeback_rate_update_seconds is also helpful to decide a reasonable small timeout. This patch limits sysfs interface to set dc->writeback_rate_update_seconds in range of [1, 60] seconds, and replaces the hand-coded number by macros. Changelog: v2: fix a rebase typo in v4, which is pointed out by Michael Lyle. v1: initial version. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | 682811b3ce |
bcache: fix for allocator and register thread race
After long time running of random small IO writing, I reboot the machine, and after the machine power on, I found bcache got stuck, the stack is: [root@ceph153 ~]# cat /proc/2510/task/*/stack [<ffffffffa06b2455>] closure_sync+0x25/0x90 [bcache] [<ffffffffa06b6be8>] bch_journal+0x118/0x2b0 [bcache] [<ffffffffa06b6dc7>] bch_journal_meta+0x47/0x70 [bcache] [<ffffffffa06be8f7>] bch_prio_write+0x237/0x340 [bcache] [<ffffffffa06a8018>] bch_allocator_thread+0x3c8/0x3d0 [bcache] [<ffffffff810a631f>] kthread+0xcf/0xe0 [<ffffffff8164c318>] ret_from_fork+0x58/0x90 [<ffffffffffffffff>] 0xffffffffffffffff [root@ceph153 ~]# cat /proc/2038/task/*/stack [<ffffffffa06b1abd>] __bch_btree_map_nodes+0x12d/0x150 [bcache] [<ffffffffa06b1bd1>] bch_btree_insert+0xf1/0x170 [bcache] [<ffffffffa06b637f>] bch_journal_replay+0x13f/0x230 [bcache] [<ffffffffa06c75fe>] run_cache_set+0x79a/0x7c2 [bcache] [<ffffffffa06c0cf8>] register_bcache+0xd48/0x1310 [bcache] [<ffffffff812f702f>] kobj_attr_store+0xf/0x20 [<ffffffff8125b216>] sysfs_write_file+0xc6/0x140 [<ffffffff811dfbfd>] vfs_write+0xbd/0x1e0 [<ffffffff811e069f>] SyS_write+0x7f/0xe0 [<ffffffff8164c3c9>] system_call_fastpath+0x16/0x1 The stack shows the register thread and allocator thread were getting stuck when registering cache device. I reboot the machine several times, the issue always exsit in this machine. I debug the code, and found the call trace as bellow: register_bcache() ==>run_cache_set() ==>bch_journal_replay() ==>bch_btree_insert() ==>__bch_btree_map_nodes() ==>btree_insert_fn() ==>btree_split() //node need split ==>btree_check_reserve() In btree_check_reserve(), It will check if there is enough buckets of RESERVE_BTREE type, since allocator thread did not work yet, so no buckets of RESERVE_BTREE type allocated, so the register thread waits on c->btree_cache_wait, and goes to sleep. Then the allocator thread initialized, the call trace is bellow: bch_allocator_thread() ==>bch_prio_write() ==>bch_journal_meta() ==>bch_journal() ==>journal_wait_for_write() In journal_wait_for_write(), It will check if journal is full by journal_full(), but the long time random small IO writing causes the exhaustion of journal buckets(journal.blocks_free=0), In order to release the journal buckets, the allocator calls btree_flush_write() to flush keys to btree nodes, and waits on c->journal.wait until btree nodes writing over or there has already some journal buckets space, then the allocator thread goes to sleep. but in btree_flush_write(), since bch_journal_replay() is not finished, so no btree nodes have journal (condition "if (btree_current_write(b)->journal)" never satisfied), so we got no btree node to flush, no journal bucket released, and allocator sleep all the times. Through the above analysis, we can see that: 1) Register thread wait for allocator thread to allocate buckets of RESERVE_BTREE type; 2) Alloctor thread wait for register thread to replay journal, so it can flush btree nodes and get journal bucket. then they are all got stuck by waiting for each other. Hua Rui provided a patch for me, by allocating some buckets of RESERVE_BTREE type in advance, so the register thread can get bucket when btree node splitting and no need to waiting for the allocator thread. I tested it, it has effect, and register thread run a step forward, but finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE type were allocated, and in bch_journal_replay(), after 2 btree nodes splitting, only 4 bucket of RESERVE_BTREE type left, then btree_check_reserve() is not satisfied anymore, so it goes to sleep again, and in the same time, alloctor thread did not flush enough btree nodes to release a journal bucket, so they all got stuck again. So we need to allocate more buckets of RESERVE_BTREE type in advance, but how much is enough? By experience and test, I think it should be as much as journal buckets. Then I modify the code as this patch, and test in the machine, and it works. This patch modified base on Hua Rui’s patch, and allocate more buckets of RESERVE_BTREE type in advance to avoid register thread and allocate thread going to wait for each other. [patch v2] ca->sb.njournal_buckets would be 0 in the first time after cache creation, and no journal exists, so just 8 btree buckets is OK. Signed-off-by: Hua Rui <huarui.dev@gmail.com> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 7ba0d830dc |
bcache: set error_limit correctly
Struct cache uses io_errors for two purposes, - Error decay: when cache set error_decay is set, io_errors is used to generate a small piece of delay when I/O error happens. - I/O errors counter: in order to generate big enough value for error decay, I/O errors counter value is stored by left shifting 20 bits (a.k.a IO_ERROR_SHIFT). In function bch_count_io_errors(), if I/O errors counter reaches cache set error limit, bch_cache_set_error() will be called to retire the whold cache set. But current code is problematic when checking the error limit, see the following code piece from bch_count_io_errors(), 90 if (error) { 91 char buf[BDEVNAME_SIZE]; 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT, 93 &ca->io_errors); 94 errors >>= IO_ERROR_SHIFT; 95 96 if (errors < ca->set->error_limit) 97 pr_err("%s: IO error on %s, recovering", 98 bdevname(ca->bdev, buf), m); 99 else 100 bch_cache_set_error(ca->set, 101 "%s: too many IO errors %s", 102 bdevname(ca->bdev, buf), m); 103 } At line 94, errors is right shifting IO_ERROR_SHIFT bits, now it is real errors counter to compare at line 96. But ca->set->error_limit is initia- lized with an amplified value in bch_cache_set_alloc(), 1545 c->error_limit = 8 << IO_ERROR_SHIFT; It means by default, in bch_count_io_errors(), before 8<<20 errors happened bch_cache_set_error() won't be called to retire the problematic cache device. If the average request size is 64KB, it means bcache won't handle failed device until 512GB data is requested. This is too large to be an I/O threashold. So I believe the correct error limit should be much less. This patch sets default cache set error limit to 8, then in bch_count_io_errors() when errors counter reaches 8 (if it is default value), function bch_cache_set_error() will be called to retire the whole cache set. This patch also removes bits shifting when store or show io_error_limit value via sysfs interface. Nowadays most of SSDs handle internal flash failure automatically by LBA address re-indirect mapping. If an I/O error can be observed by upper layer code, it will be a notable error because that SSD can not re-indirect map the problematic LBA address to an available flash block. This situation indicates the whole SSD will be failed very soon. Therefore setting 8 as the default io error limit value makes sense, it is enough for most of cache devices. Changelog: v2: add reviewed-by from Hannes. v1: initial version for review. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Coly Li | 99361bbf26 |
bcache: properly set task state in bch_writeback_thread()
Kernel thread routine bch_writeback_thread() has the following code block, 447 down_write(&dc->writeback_lock); 448~450 if (check conditions) { 451 up_write(&dc->writeback_lock); 452 set_current_state(TASK_INTERRUPTIBLE); 453 454 if (kthread_should_stop()) 455 return 0; 456 457 schedule(); 458 continue; 459 } If condition check is true, its task state is set to TASK_INTERRUPTIBLE and call schedule() to wait for others to wake up it. There are 2 issues in current code, 1, Task state is set to TASK_INTERRUPTIBLE after the condition checks, if another process changes the condition and call wake_up_process(dc-> writeback_thread), then at line 452 task state is set back to TASK_INTERRUPTIBLE, the writeback kernel thread will lose a chance to be waken up. 2, At line 454 if kthread_should_stop() is true, writeback kernel thread will return to kernel/kthread.c:kthread() with TASK_INTERRUPTIBLE and call do_exit(). It is not good to enter do_exit() with task state TASK_INTERRUPTIBLE, in following code path might_sleep() is called and a warning message is reported by __might_sleep(): "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at [xxxx]". For the first issue, task state should be set before condition checks. Ineed because dc->writeback_lock is required when modifying all the conditions, calling set_current_state() inside code block where dc-> writeback_lock is hold is safe. But this is quite implicit, so I still move set_current_state() before all the condition checks. For the second issue, frankley speaking it does not hurt when kernel thread exits with TASK_INTERRUPTIBLE state, but this warning message scares users, makes them feel there might be something risky with bcache and hurt their data. Setting task state to TASK_RUNNING before returning fixes this problem. In alloc.c:allocator_wait(), there is also a similar issue, and is also fixed in this patch. Changelog: v3: merge two similar fixes into one patch v2: fix the race issue in v1 patch. v1: initial buggy fix. Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Michael Lyle <mlyle@lyle.org> Cc: Michael Lyle <mlyle@lyle.org> Cc: Junhui Tang <tang.junhui@zte.com.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Tang Junhui | c4dc2497d5 |
bcache: fix high CPU occupancy during journal
After long time small writing I/O running, we found the occupancy of CPU
is very high and I/O performance has been reduced by about half:
[root@ceph151 internal]# top
top - 15:51:05 up 1 day,2:43, 4 users, load average: 16.89, 15.15, 16.53
Tasks: 2063 total, 4 running, 2059 sleeping, 0 stopped, 0 zombie
%Cpu(s):4.3 us, 17.1 sy 0.0 ni, 66.1 id, 12.0 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem : 65450044 total, 24586420 free, 38909008 used, 1954616 buff/cache
KiB Swap: 65667068 total, 65667068 free, 0 used. 25136812 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2023 root 20 0 0 0 0 S 55.1 0.0 0:04.42 kworker/11:191
14126 root 20 0 0 0 0 S 42.9 0.0 0:08.72 kworker/10:3
9292 root 20 0 0 0 0 S 30.4 0.0 1:10.99 kworker/6:1
8553 ceph 20 0 4242492 1.805g 18804 S 30.0 2.9 410:07.04 ceph-osd
12287 root 20 0 0 0 0 S 26.7 0.0 0:28.13 kworker/7:85
31019 root 20 0 0 0 0 S 26.1 0.0 1:30.79 kworker/22:1
1787 root 20 0 0 0 0 R 25.7 0.0 5:18.45 kworker/8:7
32169 root 20 0 0 0 0 S 14.5 0.0 1:01.92 kworker/23:1
21476 root 20 0 0 0 0 S 13.9 0.0 0:05.09 kworker/1:54
2204 root 20 0 0 0 0 S 12.5 0.0 1:25.17 kworker/9:10
16994 root 20 0 0 0 0 S 12.2 0.0 0:06.27 kworker/5:106
15714 root 20 0 0 0 0 R 10.9 0.0 0:01.85 kworker/19:2
9661 ceph 20 0
|
|
Tang Junhui | a728eacbbd |
bcache: add journal statistic
Sometimes, Journal takes up a lot of CPU, we need statistics to know what's the journal is doing. So this patch provide some journal statistics: 1) reclaim: how many times the journal try to reclaim resource, usually the journal bucket or/and the pin are exhausted. 2) flush_write: how many times the journal try to flush btree node to cache device, usually the journal bucket are exhausted. 3) retry_flush_write: how many times the journal retry to flush the next btree node, usually the previous tree node have been flushed by other thread. we show these statistic by sysfs interface. Through these statistics We can totally see the status of journal module when the CPU is too high. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Linus Torvalds | 64b28683de |
for-linus-20180204
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCAAGBQJadzbSAAoJEPfTWPspceCmt5QP/jo6MSsNVevAQOE75Jje+qa/ aF/BjHBdUmmI5WtPrtoz4igaJou7M2U0s8jdsc3c7uMw8dGTKc6ujIquSEn0wevY faJPTjWzLum3y50gwRHcrHCQIlxOe5/f9rJevW4+q76aMP3aWKjO4bgBExH+2XnA CaT+6d40skYt20Sy428H0yhVdDAMiQYXTeg4SssWQY9AvJSSiW7ax+vmP3r5BKpV dXHggwgzqDuMwLZG80Tfg4GHGv5qisIrqLOCxtXNYHDNb/aDmbTFTO2jPgobT8gW N2kWxsOkBayUdPw6Nt2Wlm4toQgR5GJGH04LH2vI5p4dp4Grvx/aFGvUbT7+sN1u g/mmqsUUnYuO5AJ8XY2s2F7ezaT6v9x8BbLHuA2vz0r5GsdFVXctZ/bXgQqkmh9i KLtfyOPldlczclVEuKL4xai1aXLcoBzDwyLxzbFp3+eAlhcgoSqxnMsE4fCJblCU dfShDChu1SbBD6dyGx8sol9cT48RFj2tBtpfcYxFW/NJJOQoh9FTqPQetYQxQ72c TadEf40hmw5Q2l0Hu5pwVbKHWUP0wn0VznkAOfT4VV1ysk93oExMbjgS2qh16xEZ oQwFDQMk3D8BXI9VwH8gUUnypkhcooMhznxSC3BQxjGn/R+byp7QEPvxSEZz/4nD BaBSbyAU5cpof+Eaqs4B =qeDb -----END PGP SIGNATURE----- Merge tag 'for-linus-20180204' of git://git.kernel.dk/linux-block Pull more block updates from Jens Axboe: "Most of this is fixes and not new code/features: - skd fix from Arnd, fixing a build error dependent on sla allocator type. - blk-mq scheduler discard merging fixes, one from me and one from Keith. This fixes a segment miscalculation for blk-mq-sched, where we mistakenly think two segments are physically contigious even though the request isn't carrying real data. Also fixes a bio-to-rq merge case. - Don't re-set a bit on the buffer_head flags, if it's already set. This can cause scalability concerns on bigger machines and workloads. From Kemi Wang. - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to distuingish between a local (device related) resource starvation and a global one. The latter might happen without IO being in flight, so it has to be handled a bit differently. From Ming" * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block: block: skd: fix incorrect linux/slab_def.h inclusion buffer: Avoid setting buffer bits that are already set blk-mq-sched: Enable merging discard bio into request blk-mq: fix discard merge with scheduler attached blk-mq: introduce BLK_STS_DEV_RESOURCE |
|
Linus Torvalds | 0be600a5ad |
- DM core fixes to ensure that bio submission follows a depth-first tree
walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc. cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJacgwPAAoJEMUj8QotnQNaEw0H/0XRTcg8/lRuGl46kdeI3PgR ZxUy4XgUrCLiACWO5yCU/nKipB32+3xTlTDTBcjmaBfX8HolH147Pasb1KdHqLVC dOWLMpjlFztb5fnuOMitJA05qQAbgRlZ52QdVk/FDo9yWicgWjQZduh8aYX53pHw 6XOYWzSFAXQcaduPdz6TLiPw479xBwIpXxQbrO09f4qt3Ub4bqknEhzFXc+6M7zl ejmW/bG2Qg6WmsfAuaAhFTV0LpTPSEzvaq9TfR7yqFU3DvDIAi7Yh8eQinIUDo4u txpOGoESRAMPAMKH0/UJdr/u7jTsfgJox4QEavWfnViPvkouah5KdjVOL1veZ5U= =R3dN -----END PGP SIGNATURE----- Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - DM core fixes to ensure that bio submission follows a depth-first tree walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits) dm cache: Documentation: update default migration_throttling value dm mpath selector: more evenly distribute ties dm unstripe: fix target length versus number of stripes size check dm thin: fix trailing semicolon in __remap_and_issue_shared_cell dm table: fix NVMe bio-based dm_table_determine_type() validation dm: various cleanups to md->queue initialization code dm mpath: delay the retry of a request if the target responded as busy dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure dm log writes: fix max length used for kstrndup dm: backfill missing calls to mutex_destroy() dm snapshot: use mutex instead of rw_semaphore dm flakey: check for null arg_name in parse_features() dm thin: extend thinpool status format string with omitted fields dm thin: fixes in thin-provisioning.txt dm thin: document representation of <highest mapped sector> when there is none dm thin: fix documentation relative to low water mark threshold dm cache: be consistent in specifying sectors and SI units in cache.txt dm cache: delete obsoleted paragraph in cache.txt dm cache: fix grammar in cache-policies.txt ... |
|
Linus Torvalds | 040639b7fc |
Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li: "Some small fixes for MD: - fix raid5-cache potential problems if raid5 cache isn't fully recovered - fix a wait-within-wait warning in raid1/10 - make raid5-PPL support disks with writeback cache enabled" * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: raid5-ppl: PPL support for disks with write-back cache enabled md/r5cache: print more info of log recovery md/raid1,raid10: silence warning about wait-within-wait md: introduce new personality funciton start() |
|
Ming Lei | 86ff7c2a80 |
blk-mq: introduce BLK_STS_DEV_RESOURCE
This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe <axboe@kernel.dk> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Linus Torvalds | 168fe32a07 |
Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ... |
|
Linus Torvalds | 0a4b6e2f80 |
Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe: "This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself" * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits) block: remove smart1,2.h nvme: add tracepoint for nvme_complete_rq nvme: add tracepoint for nvme_setup_cmd nvme-pci: introduce RECONNECTING state to mark initializing procedure nvme-rdma: remove redundant boolean for inline_data nvme: don't free uuid pointer before printing it nvme-pci: Suspend queues after deleting them bsg: use pr_debug instead of hand crafted macros blk-mq-debugfs: don't allow write on attributes with seq_operations set nvme-pci: Fix queue double allocations block: Set BIO_TRACE_COMPLETION on new bio during split blk-throttle: use queue_is_rq_based block: Remove kblockd_schedule_delayed_work{,_on}() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-throttle: track read and write request individually block: add bdev_read_only() checks to common helpers block: fail op_is_write() requests to read-only partitions blk-throttle: export io_serviced_recursive, io_service_bytes_recursive ... |
|
Khazhismel Kumykov | f20426056f |
dm mpath selector: more evenly distribute ties
Move the last used path to the end of the list (least preferred) so that ties are more evenly distributed. For example, in case with three paths with one that is slower than others, the remaining two would be unevenly used if they tie. This is due to the rotation not being a truely fair distribution. Illustrated: paths a, b, c, 'c' has 1 outstanding IO, a and b are 'tied' Three possible rotations: (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' ... So 'a' is used 2x more than 'b', although they should be used evenly. With this change, the most recently used path is always the least preferred, removing this bias resulting in even distribution. (a, b, c) -> best path 'a' (b, c, a) -> best path 'b' (c, a, b) -> best path 'a' (c, b, a) -> best path 'b' ... Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Scott Bauer | cc65661981 |
dm unstripe: fix target length versus number of stripes size check
Since the unstripe target takes a target length which is the size of *one* striped member we're trying to expose, not the total size of *all* the striped members, the check does not make sense and fails for some striped setups. For example, say we have a 4TB striped device: or 3907018496 sectors per underlying device: if (sector_div(width, uc->stripes)) : 3907018496 / 2(num stripes) == 1953509248 tmp_len = width; if (sector_div(tmp_len, uc->chunk_size)) : 1953509248 / 256(chunk size) == 7630895.5 (fails) Fix this by removing the first check which isn't valid for unstriping. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Luis de Bethencourt | bd6d1e0a5f |
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
The trailing semicolon is an empty statement that does no operation. Removing it since it doesn't do anything. Signed-off-by: Luis de Bethencourt <luisbg@kernel.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mike Snitzer | eaa160eded |
dm table: fix NVMe bio-based dm_table_determine_type() validation
The 'verify_rq_based:' code in dm_table_determine_type() was checking
all devices in the DM table rather than only checking the data devices.
Fix this by using the immutable target's iterate_devices method.
Also, tweak the block of dm_table_determine_type() code that decides
whether to upgrade from DM_TYPE_BIO_BASED to DM_TYPE_NVME_BIO_BASED so
that it makes sure the immutable_target doesn't support require
splitting IOs.
These changes have been verified to allow a "thin-pool" target whose
data device is an NVMe device to be upgraded to DM_TYPE_NVME_BIO_BASED.
Using the thin-pool in NVMe bio-based mode was verified to pass all the
device-mapper-test-suite's "thin-provisioning" tests.
Also verified that request-based DM multipath (with queue_mode "rq" and
"mq") works as expected using the 'mptest' harness.
Fixes:
|
|
Mike Snitzer | c12c9a3c38 |
dm: various cleanups to md->queue initialization code
Also, add dm_sysfs_init() error handling to dm_create(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mike Snitzer | ac514ffc96 |
dm mpath: delay the retry of a request if the target responded as busy
Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue. This is beneficial to do if BLK_STS_RESOURCE is returned from the target (because target is busy). Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(), indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay. For old .request_fn: use blk_delay_queue(). bio-based multipath doesn't have feature parity with request-based for retryable error requeues; that is something that'll need fixing in the future. Suggested-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Bart Van Assche <bart.vanassche@wdc.com> [as interpreted from Bart's "... patch looks fine to me."] |
|
Ming Lei | 396eaf21ee |
blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback
blk_insert_cloned_request() is called in the fast path of a dm-rq driver (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses blk_mq_request_bypass_insert() to directly append the request to the blk-mq hctx->dispatch_list of the underlying queue. 1) This way isn't efficient enough because the hctx spinlock is always used. 2) With blk_insert_cloned_request(), we completely bypass underlying queue's elevator and depend on the upper-level dm-rq driver's elevator to schedule IO. But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. Fix this by updating blk_insert_cloned_request() to use blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a request to be issued directly to the underlying queue and returns the dispatch feedback (blk_status_t). If blk_mq_request_direct_issue() returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE to _not_ destage the request. Whereby preserving the opportunity to merge IO. With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing). Signed-off-by: Ming Lei <ming.lei@redhat.com> [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but they were refactored to make them less fragile and easier to read/review] Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
Ming Lei | 459b54019c |
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
Avoid using DM_MAPIO_REQUEUE unless absolutely necessary because it results in dm-rq.c:dm_mq_queue_rq() returning BLK_STS_RESOURCE to blk-mq -- doing so should only ever be done if the underlying queue is out of resources. So switch to returning DM_MAPIO_DELAY_REQUEUE from multipath_clone_and_map() if either MPATHF_QUEUE_IO or MPATHF_PG_INIT_REQUIRED are set. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Ming Lei | 050af08ffb |
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
blk-mq will rerun queue via RESTART or dispatch wake after one request is completed, so not necessary to wait random time for requeuing, we should trust blk-mq to do it. More importantly, we need to return BLK_STS_RESOURCE to blk-mq so that dequeuing from the I/O scheduler can be stopped, this results in improved I/O merging. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Ma Shimiao | 4b259fc4a8 |
dm log writes: fix max length used for kstrndup
If source string is longer than max, kstrndup will allocate max+1 space. So make sure the result will not exceed max. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mike Snitzer | d5ffebdd79 |
dm: backfill missing calls to mutex_destroy()
Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mikulas Patocka | ae1093be5a |
dm snapshot: use mutex instead of rw_semaphore
The rw_semaphore is acquired for read only in two places, neither is performance-critical. So replace it with a mutex -- which is more efficient. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Goldwyn Rodrigues | 7690e25302 |
dm flakey: check for null arg_name in parse_features()
One can crash dm-flakey by specifying more feature arguments than the number of features supplied. Checking for null in arg_name avoids this. dmsetup create flakey-test --table "0 66076080 flakey /dev/sdb9 0 0 180 2 drop_writes" Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Brian Norris | f6e7baadd9 |
dm: move dm_table_destroy() to same header as dm_table_create()
If anyone is going to use dm_table_create(), they probably should be able to use dm_table_destroy() too. Move the dm_table_destroy() definition outside the private header, near dm_table_create() Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Wei Yongjun | 67ac901c55 |
dm raid: make raid_sets symbol static
Fixes the following sparse warning: drivers/md/dm-raid.c:33:1: warning: symbol 'raid_sets' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mike Snitzer | 0e696d385d |
dm bufio: eliminate unnecessary labels in dm_bufio_client_create()
Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Aliaksei Karaliou | 46898e9a7a |
dm bufio: check result of register_shrinker()
dm_bufio_client_create() does not check result of register_shrinker() which was tagged as __must_check recently, reported by sparse. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Aliaksei Karaliou | bde1418478 |
dm bufio: add missed destroys of client mutex
The client's mutex needs to be destroyed in dm_bufio_client_destroy() as well as the dm_bufio_client_create() error path. Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |
|
Mikulas Patocka | 905be0a121 |
dm bufio: use REQ_OP_READ and REQ_OP_WRITE
Use REQ_OP_READ and REQ_OP_WRITE macros instead of READ and WRITE. They have the same value, but the block layer uses REQ_OP so bufio should too. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> |