mirror of https://gitee.com/openkylin/linux.git
39 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
NeilBrown | c5a94f434c |
fscache: fix race between enablement and dropping of object
It was observed that a process blocked indefintely in __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP to be cleared via fscache_wait_for_deferred_lookup(). At this time, ->backing_objects was empty, which would normaly prevent __fscache_read_or_alloc_page() from getting to the point of waiting. This implies that ->backing_objects was cleared *after* __fscache_read_or_alloc_page was was entered. When an object is "killed" and then "dropped", FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is ->backing_objects cleared. This leaves a window where something else can set FSCACHE_COOKIE_LOOKING_UP and __fscache_read_or_alloc_page() can start waiting, before ->backing_objects is cleared There is some uncertainty in this analysis, but it seems to be fit the observations. Adding the wake in this patch will be handled correctly by __fscache_read_or_alloc_page(), as it checks if ->backing_objects is empty again, after waiting. Customer which reported the hang, also report that the hang cannot be reproduced with this fix. The backtrace for the blocked process looked like: PID: 29360 TASK: ffff881ff2ac0f80 CPU: 3 COMMAND: "zsh" #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1 #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8 #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache] #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache] #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs] #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs] #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73 #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs] #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756 #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa #12 [ffff881ff43eff18] sys_read at ffffffff811fda62 #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David Howells <dhowells@redhat.com> |
|
Kiran Kumar Modukuri | f29507ce66 |
fscache: Fix reference overput in fscache_attach_object() error handling
When a cookie is allocated that causes fscache_object structs to be
allocated, those objects are initialised with the cookie pointer, but
aren't blessed with a ref on that cookie unless the attachment is
successfully completed in fscache_attach_object().
If attachment fails because the parent object was dying or there was a
collision, fscache_attach_object() returns without incrementing the cookie
counter - but upon failure of this function, the object is released which
then puts the cookie, whether or not a ref was taken on the cookie.
Fix this by taking a ref on the cookie when it is assigned in
fscache_object_init(), even when we're creating a root object.
Analysis from Kiran Kumar:
This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel
BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277
fscache cookie ref count updated incorrectly during fscache object
allocation resulting in following Oops.
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
[Cause]
Two threads are trying to do operate on a cookie and two objects.
(1) One thread tries to unmount the filesystem and in process goes over a
huge list of objects marking them dead and deleting the objects.
cookie->usage is also decremented in following path:
nfs_fscache_release_super_cookie
-> __fscache_relinquish_cookie
->__fscache_cookie_put
->BUG_ON(atomic_read(&cookie->usage) <= 0);
(2) A second thread tries to lookup an object for reading data in following
path:
fscache_alloc_object
1) cachefiles_alloc_object
-> fscache_object_init
-> assign cookie, but usage not bumped.
2) fscache_attach_object -> fails in cant_attach_object because the
cookie's backing object or cookie's->parent object are going away
3) fscache_put_object
-> cachefiles_put_object
->fscache_object_destroy
->fscache_cookie_put
->BUG_ON(atomic_read(&cookie->usage) <= 0);
[NOTE from dhowells] It's unclear as to the circumstances in which (2) can
take place, given that thread (1) is in nfs_kill_super(), however a
conflicting NFS mount with slightly different parameters that creates a
different superblock would do it. A backtrace from Kiran seems to show
that this is a possibility:
kernel BUG at/build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
...
RIP: __fscache_cookie_put+0x3a/0x40 [fscache]
Call Trace:
__fscache_relinquish_cookie+0x87/0x120 [fscache]
nfs_fscache_release_super_cookie+0x2d/0xb0 [nfs]
nfs_kill_super+0x29/0x40 [nfs]
deactivate_locked_super+0x48/0x80
deactivate_super+0x5c/0x60
cleanup_mnt+0x3f/0x90
__cleanup_mnt+0x12/0x20
task_work_run+0x86/0xb0
exit_to_usermode_loop+0xc2/0xd0
syscall_return_slowpath+0x4e/0x60
int_ret_from_sys_call+0x25/0x9f
[Fix] Bump up the cookie usage in fscache_object_init, when it is first
being assigned a cookie atomically such that the cookie is added and bumped
up if its refcount is not zero. Remove the assignment in
fscache_attach_object().
[Testcase]
I have run ~100 hours of NFS stress tests and not seen this bug recur.
[Regression Potential]
- Limited to fscache/cachefiles.
Fixes:
|
|
Matthew Wilcox | e5a9554196 |
fscache: use appropriate radix tree accessors
Don't open-code accesses to data structure internals. Link: http://lkml.kernel.org/r/20180313132639.17387-7-willy@infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Howells | ee1235a9a0 |
fscache: Pass object size in rather than calling back for it
Pass the object size in to fscache_acquire_cookie() and fscache_write_page() rather than the netfs providing a callback by which it can be received. This makes it easier to update the size of the object when a new page is written that extends the object. The current object size is also passed by fscache to the check_aux function, obviating the need to store it in the aux data. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Anna Schumaker <anna.schumaker@netapp.com> Tested-by: Steve Dickson <steved@redhat.com> |
|
David Howells | 402cb8dda9 |
fscache: Attach the index key and aux data to the cookie
Attach copies of the index key and auxiliary data to the fscache cookie so that: (1) The callbacks to the netfs for this stuff can be eliminated. This can simplify things in the cache as the information is still available, even after the cache has relinquished the cookie. (2) Simplifies the locking requirements of accessing the information as we don't have to worry about the netfs object going away on us. (3) The cache can do lazy updating of the coherency information on disk. As long as the cache is flushed before reboot/poweroff, there's no need to update the coherency info on disk every time it changes. (4) Cookies can be hashed or put in a tree as the index key is easily available. This allows: (a) Checks for duplicate cookies can be made at the top fscache layer rather than down in the bowels of the cache backend. (b) Caching can be added to a netfs object that has a cookie if the cache is brought online after the netfs object is allocated. A certain amount of space is made in the cookie for inline copies of the data, but if it won't fit there, extra memory will be allocated for it. The downside of this is that live cache operation requires more memory. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Anna Schumaker <anna.schumaker@netapp.com> Tested-by: Steve Dickson <steved@redhat.com> |
|
David Howells | 08c2e3d087 |
fscache: Add more tracepoints
Add more tracepoints to fscache, including: (*) fscache_page - Tracks netfs pages known to fscache. (*) fscache_check_page - Tracks the netfs querying whether a page is pending storage. (*) fscache_wake_cookie - Tracks cookies being woken up after a page completes/aborts storage in the cache. (*) fscache_op - Tracks operations being initialised. (*) fscache_wrote_page - Tracks return of the backend write_page op. (*) fscache_gang_lookup - Tracks lookup of pages to be stored in the write operation. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | a18feb5576 |
fscache: Add tracepoints
Add some tracepoints to fscache: (*) fscache_cookie - Tracks a cookie's usage count. (*) fscache_netfs - Logs registration of a network filesystem, including the pointer to the cookie allocated. (*) fscache_acquire - Logs cookie acquisition. (*) fscache_relinquish - Logs cookie relinquishment. (*) fscache_enable - Logs enablement of a cookie. (*) fscache_disable - Logs disablement of a cookie. (*) fscache_osm - Tracks execution of states in the object state machine. and cachefiles: (*) cachefiles_ref - Tracks a cachefiles object's usage count. (*) cachefiles_lookup - Logs result of lookup_one_len(). (*) cachefiles_mkdir - Logs result of vfs_mkdir(). (*) cachefiles_create - Logs result of vfs_create(). (*) cachefiles_unlink - Logs calls to vfs_unlink(). (*) cachefiles_rename - Logs calls to vfs_rename(). (*) cachefiles_mark_active - Logs an object becoming active. (*) cachefiles_wait_active - Logs a wait for an old object to be destroyed. (*) cachefiles_mark_inactive - Logs an object becoming inactive. (*) cachefiles_mark_buried - Logs the burial of an object. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | e26bfebdfc |
fscache: Fix dead object requeue
Under some circumstances, an fscache object can become queued such that it fscache_object_work_func() can be called once the object is in the OBJECT_DEAD state. This results in the kernel oopsing when it tries to invoke the handler for the state (which is hard coded to 0x2). The way this comes about is something like the following: (1) The object dispatcher is processing a work state for an object. This is done in workqueue context. (2) An out-of-band event comes in that isn't masked, causing the object to be queued, say EV_KILL. (3) The object dispatcher finishes processing the current work state on that object and then sees there's another event to process, so, without returning to the workqueue core, it processes that event too. It then follows the chain of events that initiates until we reach OBJECT_DEAD without going through a wait state (such as WAIT_FOR_CLEARANCE). At this point, object->events may be 0, object->event_mask will be 0 and oob_event_mask will be 0. (4) The object dispatcher returns to the workqueue processor, and in due course, this sees that the object's work item is still queued and invokes it again. (5) The current state is a work state (OBJECT_DEAD), so the dispatcher jumps to it - resulting in an OOPS. When I'm seeing this, the work state in (1) appears to have been either LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is fscache_osm_lookup_oob). The window for (2) is very small: (A) object->event_mask is cleared whilst the event dispatch process is underway - though there's no memory barrier to force this to the top of the function. The window, therefore is from the time the object was selected by the workqueue processor and made requeueable to the time the mask was cleared. (B) fscache_raise_event() will only queue the object if it manages to set the event bit and the corresponding event_mask bit was set. The enqueuement is then deferred slightly whilst we get a ref on the object and get the per-CPU variable for workqueue congestion. This slight deferral slightly increases the probability by allowing extra time for the workqueue to make the item requeueable. Handle this by giving the dead state a processor function and checking the for the dead state address rather than seeing if the processor function is address 0x2. The dead state processor function can then set a flag to indicate that it's occurred and give a warning if it occurs more than once per object. If this race occurs, an oops similar to the following is seen (note the RIP value): BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 IP: [<0000000000000002>] 0x1 PGD 0 Oops: 0010 [#1] SMP Modules linked in: ... CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000 RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1 RSP: 0018:ffff880717547df8 EFLAGS: 00010202 RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200 RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480 RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510 R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000 R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600 FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900 Call Trace: [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816460d8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeremy McNicoll <jeremymc@redhat.com> Tested-by: Frank Sorenson <sorenson@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
David Howells | 6bdded59c8 |
fscache: Clear outstanding writes when disabling a cookie
fscache_disable_cookie() needs to clear the outstanding writes on the cookie it's disabling because they cannot be completed after. Without this, fscache_nfs_open_file() gets stuck because it disables the cookie when the file is opened for writing but can't uncache the pages till afterwards - otherwise there's a race between the open routine and anyone who already has it open R/O and is still reading from it. Looking in /proc/pid/stack of the offending process shows: [<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache] [<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache] [<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs] [<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4] [<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7 [<ffffffff811743ac>] vfs_open+0x5c/0x65 [<ffffffff81184185>] path_openat+0x785/0x8fb [<ffffffff81184343>] do_filp_open+0x48/0x9e [<ffffffff81174710>] do_sys_open+0x13b/0x1cb [<ffffffff811747b9>] SyS_open+0x19/0x1b [<ffffffff81001c44>] do_syscall_64+0x80/0x17a [<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Jianhong Yin <jiyin@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
David Howells | d3b97ca4a9 |
FS-Cache: The operation cancellation method needs calling in more places
Any time an incomplete operation is cancelled, the operation cancellation function needs to be called to clean up. This is currently being passed directly to some of the functions that might want to call it, but not all. Instead, pass the cancellation method pointer to the fscache_operation_init() and have that cache it in the operation struct. Further, plug in a dummy cancellation handler if the caller declines to set one as this allows us to call the function unconditionally (the extra overhead isn't worth bothering about as we don't expect to be calling this typically). The cancellation method must thence be called everywhere the CANCELLED state is set. Note that we call it *before* setting the CANCELLED state such that the method can use the old state value to guide its operation. fscache_do_cancel_retrieval() needs moving higher up in the sources so that the init function can use it now. Without this, the following oops may be seen: FS-Cache: Assertion failed FS-Cache: 3 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/fscache/page.c:261! ... RIP: 0010:[<ffffffffa0089c1b>] fscache_release_retrieval_op+0x77/0x100 [<ffffffffa008853d>] fscache_put_operation+0x114/0x2da [<ffffffffa008b8c2>] __fscache_read_or_alloc_pages+0x358/0x3b3 [<ffffffffa00b761f>] __nfs_readpages_from_fscache+0x59/0xbf [nfs] [<ffffffffa00b06c5>] nfs_readpages+0x10c/0x185 [nfs] [<ffffffff81124925>] ? alloc_pages_current+0x119/0x13e [<ffffffff810ee5fd>] ? __page_cache_alloc+0xfb/0x10a [<ffffffff810f87f8>] __do_page_cache_readahead+0x188/0x22c [<ffffffff810f8b3a>] ondemand_readahead+0x29e/0x2af [<ffffffff810f8c92>] page_cache_sync_readahead+0x38/0x3a [<ffffffff810ef337>] generic_file_read_iter+0x1a2/0x55a [<ffffffffa00a9dff>] ? nfs_revalidate_mapping+0xd6/0x288 [nfs] [<ffffffffa00a6a23>] nfs_file_read+0x49/0x70 [nfs] [<ffffffff811363be>] new_sync_read+0x78/0x9c [<ffffffff81137164>] __vfs_read+0x13/0x38 [<ffffffff8113721e>] vfs_read+0x95/0x121 [<ffffffff811372f6>] SyS_read+0x4c/0x8a [<ffffffff81557a52>] system_call_fastpath+0x12/0x17 The assertion is showing that the remaining number of pages (n_pages) is not 0 when the operation is being released. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com> |
|
David Howells | f09b443d0e |
FS-Cache: Synchronise object death state change vs operation submission
When an object is being marked as no longer live, do this under the object spinlock to prevent a race with operation submission targeted on that object. The problem occurs due to the following pair of intertwined sequences when the cache tries to create an object that would take it over the hard available space limit: NETFS INTERFACE =============== (A) The netfs calls fscache_acquire_cookie(). object creation is deferred to the object state machine and the netfs is allowed to continue. OBJECT STATE MACHINE KTHREAD ============================ (1) The object is looked up on disk by fscache_look_up_object() calling cachefiles_walk_to_object(). The latter finds that the object is not yet represented on disk and calls fscache_object_lookup_negative(). (2) fscache_object_lookup_negative() sets FSCACHE_COOKIE_NO_DATA_YET and clears FSCACHE_COOKIE_LOOKING_UP, thus allowing the netfs to start queuing read operations. (B) The netfs calls fscache_read_or_alloc_pages(). This calls fscache_wait_for_deferred_lookup() which sees FSCACHE_COOKIE_LOOKING_UP become clear, allowing the read to begin. (C) A read operation is set up and passed to fscache_submit_op() to deal with. (3) cachefiles_walk_to_object() calls cachefiles_has_space(), which fails (or one of the file operations to create stuff fails). cachefiles returns an error to fscache. (4) fscache_look_up_object() transits to the LOOKUP_FAILURE state, (5) fscache_lookup_failure() sets FSCACHE_OBJECT_LOOKED_UP and FSCACHE_COOKIE_UNAVAILABLE and clears FSCACHE_COOKIE_LOOKING_UP then transits to the KILL_OBJECT state. (6) fscache_kill_object() clears FSCACHE_OBJECT_IS_LIVE in an attempt to reject any further requests from the netfs. (7) object->n_ops is examined and found to be 0. fscache_kill_object() transits to the DROP_OBJECT state. (D) fscache_submit_op() locks the object spinlock, sees if it can dispatch the op immediately by calling fscache_object_is_active() - which fails since FSCACHE_OBJECT_IS_AVAILABLE has not yet been set. (E) fscache_submit_op() then tests FSCACHE_OBJECT_LOOKED_UP - which is set. It then queues the object and increments object->n_ops. (8) fscache_drop_object() releases the object and eventually fscache_put_object() calls cachefiles_put_object() which suffers an assertion failure here: ASSERTCMP(object->fscache.n_ops, ==, 0); Locking the object spinlock in step (6) around the clearance of FSCACHE_OBJECT_IS_LIVE ensures that the the decision trees in fscache_submit_op() and fscache_submit_exclusive_op() don't see the IS_LIVE flag being cleared mid-decision: either the op is queued before step (7) - in which case fscache_kill_object() will see n_ops>0 and will deal with the op - or the op will be rejected. This, combined with rejecting op submission if the target object is dying, fix the problem. The problem shows up as the following oops: CacheFiles: Assertion failed CacheFiles: 1 == 0 is false ------------[ cut here ]------------ kernel BUG at ../fs/cachefiles/interface.c:339! ... RIP: 0010:[<ffffffffa014fd9c>] [<ffffffffa014fd9c>] cachefiles_put_object+0x2a4/0x301 [cachefiles] ... Call Trace: [<ffffffffa008674b>] fscache_put_object+0x18/0x21 [fscache] [<ffffffffa00883e6>] fscache_object_work_func+0x3ba/0x3c9 [fscache] [<ffffffff81054dad>] process_one_work+0x226/0x441 [<ffffffff81055d91>] worker_thread+0x273/0x36b [<ffffffff81055b1e>] ? rescuer_thread+0x2e1/0x2e1 [<ffffffff81059b9d>] kthread+0x10e/0x116 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb [<ffffffff815579ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81059a8f>] ? kthread_create_on_node+0x1bb/0x1bb Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com> |
|
David Howells | 6515d1dbf4 |
FS-Cache: Handle a new operation submitted against a killed object
Reject new operations that are being submitted against an object if that object has failed its lookup or creation states or has been killed by the cache backend for some other reason, such as having been culled. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com> |
|
David Howells | 182d919b84 |
FS-Cache: Count culled objects and objects rejected due to lack of space
Count the number of objects that get culled by the cache backend and the number of objects that the cache backend declines to instantiate due to lack of space in the cache. These numbers are made available through /proc/fs/fscache/stats Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Steve Dickson <steved@redhat.com> Acked-by: Jeff Layton <jeff.layton@primarydata.com> |
|
Milosz Tanski | 920bce20d7 |
FS-Cache: Reduce cookie ref count if submit fails.
I've been seeing issues with disposing cookies under vma pressure. The symptom is that the refcount gets out of sync. In this case we fail to decrement the refcount if submit fails. I found this while auditing the error in and around cookie operations. Signed-off-by: Milosz Tanski <milosz@adfin.com> Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 7026f1929e |
FS-Cache: Handle removal of unadded object to the fscache_object_list rb tree
When FS-Cache allocates an object, the following sequence of events can occur: -->fscache_alloc_object() -->cachefiles_alloc_object() [via cache->ops->alloc_object] <--[returns new object] -->fscache_attach_object() <--[failed] -->cachefiles_put_object() [via cache->ops->put_object] -->fscache_object_destroy() -->fscache_objlist_remove() -->rb_erase() to remove the object from fscache_object_list. resulting in a crash in the rbtree code. The problem is that the object is only added to fscache_object_list on the success path of fscache_attach_object() where it calls fscache_objlist_add(). So if fscache_attach_object() fails, the object won't have been added to the objlist rbtree. We do, however, unconditionally try to remove the object from the tree. Thanks to NeilBrown for finding this and suggesting this solution. Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: (a customer of) NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 0910c0bdf7 |
Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block
Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ... |
|
Christoph Lameter | 170d800af8 |
block: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. At the end of the patch set all uses of __get_cpu_var have been removed so the macro is removed too. The patch set includes passes over all arches as well. Once these operations are used throughout then specialized macros can be defined in non -x86 arches as well in order to optimize per cpu access by f.e. using a global register that may be set to the per cpu base. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to this_cpu_inc(y) Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
David Howells | 94d30ae90a |
FS-Cache: Provide the ability to enable/disable cookies
Provide the ability to enable and disable fscache cookies. A disabled cookie will reject or ignore further requests to: Acquire a child cookie Invalidate and update backing objects Check the consistency of a backing object Allocate storage for backing page Read backing pages Write to backing pages but still allows: Checks/waits on the completion of already in-progress objects Uncaching of pages Relinquishment of cookies Two new operations are provided: (1) Disable a cookie: void fscache_disable_cookie(struct fscache_cookie *cookie, bool invalidate); If the cookie is not already disabled, this locks the cookie against other dis/enablement ops, marks the cookie as being disabled, discards or invalidates any backing objects and waits for cessation of activity on any associated object. This is a wrapper around a chunk split out of fscache_relinquish_cookie(), but it reinitialises the cookie such that it can be reenabled. All possible failures are handled internally. The caller should consider calling fscache_uncache_all_inode_pages() afterwards to make sure all page markings are cleared up. (2) Enable a cookie: void fscache_enable_cookie(struct fscache_cookie *cookie, bool (*can_enable)(void *data), void *data) If the cookie is not already enabled, this locks the cookie against other dis/enablement ops, invokes can_enable() and, if the cookie is not an index cookie, will begin the procedure of acquiring backing objects. The optional can_enable() function is passed the data argument and returns a ruling as to whether or not enablement should actually be permitted to begin. All possible failures are handled internally. The cookie will only be marked as enabled if provisional backing objects are allocated. A later patch will introduce these to NFS. Cookie enablement during nfs_open() is then contingent on i_writecount <= 0. can_enable() checks for a race between open(O_RDONLY) and open(O_WRONLY/O_RDWR). This simplifies NFS's cookie handling and allows us to get rid of open(O_RDONLY) accidentally introducing caching to an inode that's open for writing already. One operation has its API modified: (3) Acquire a cookie. struct fscache_cookie *fscache_acquire_cookie( struct fscache_cookie *parent, const struct fscache_cookie_def *def, void *netfs_data, bool enable); This now has an additional argument that indicates whether the requested cookie should be enabled by default. It doesn't need the can_enable() function because the caller must prevent multiple calls for the same netfs object and it doesn't need to take the enablement lock because no one else can get at the cookie before this returns. Signed-off-by: David Howells <dhowells@redhat.com |
|
David Howells | 1362729b16 |
FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
Simplify the way fscache cache objects retain their cookie. The way I implemented the cookie storage handling made synchronisation a pain (ie. the object state machine can't rely on the cookie actually still being there). Instead of the the object being detached from the cookie and the cookie being freed in __fscache_relinquish_cookie(), we defer both operations: (*) The detachment of the object from the list in the cookie now takes place in fscache_drop_object() and is thus governed by the object state machine (fscache_detach_from_cookie() has been removed). (*) The release of the cookie is now in fscache_object_destroy() - which is called by the cache backend just before it frees the object. This means that the fscache_cookie struct is now available to the cache all the way through from ->alloc_object() to ->drop_object() and ->put_object() - meaning that it's no longer necessary to take object->lock to guarantee access. However, __fscache_relinquish_cookie() doesn't wait for the object to go all the way through to destruction before letting the netfs proceed. That would massively slow down the netfs. Since __fscache_relinquish_cookie() leaves the cookie around, in must therefore break all attachments to the netfs - which includes ->def, ->netfs_data and any outstanding page read/writes. To handle this, struct fscache_cookie now has an n_active counter: (1) This starts off initialised to 1. (2) Any time the cache needs to get at the netfs data, it calls fscache_use_cookie() to increment it - if it is not zero. If it was zero, then access is not permitted. (3) When the cache has finished with the data, it calls fscache_unuse_cookie() to decrement it. This does a wake-up on it if it reaches 0. (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to reach 0. The initialisation to 1 in step (1) ensures that we only get wake ups when we're trying to get rid of the cookie. This leaves __fscache_relinquish_cookie() a lot simpler. *** This fixes a problem in the current code whereby if fscache_invalidate() is followed sufficiently quickly by fscache_relinquish_cookie() then it is possible for __fscache_relinquish_cookie() to have detached the cookie from the object and cleared the pointer before a thread is dispatched to process the invalidation state in the object state machine. Since the pending write clearance was deferred to the invalidation state to make it asynchronous, we need to either wait in relinquishment for the stores tree to be cleared in the invalidation state or we need to handle the clearance in relinquishment. Further, if the relinquishment code does clear the tree, then the invalidation state need to make the clearance contingent on still having the cookie to hand (since that's where the tree is rooted) and we have to prevent the cookie from disappearing for the duration. This can lead to an oops like the following: BUG: unable to handle kernel NULL pointer dereference at 000000000000000c ... RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30 ... CR2: 000000000000000c ... ... Process kslowd002 (...) .... Call Trace: [<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150 [<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180 [<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache] [<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0 [<ffffffff8110e233>] slow_work_execute+0x233/0x310 [<ffffffff8110e515>] slow_work_thread+0x205/0x360 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8110e310>] ? slow_work_thread+0x0/0x360 [<ffffffff81096936>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810968a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 The parameter to fscache_invalidate_writes() was object->cookie which is NULL. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com> |
|
David Howells | caaef6900b |
FS-Cache: Fix object state machine to have separate work and wait states
Fix object state machine to have separate work and wait states as that makes it easier to envision. There are now three kinds of state: (1) Work state. This is an execution state. No event processing is performed by a work state. The function attached to a work state returns a pointer indicating the next state to which the OSM should transition. Returning NO_TRANSIT repeats the current state, but goes back to the scheduler first. (2) Wait state. This is an event processing state. No execution is performed by a wait state. Wait states are just tables of "if event X occurs, clear it and transition to state Y". The dispatcher returns to the scheduler if none of the events in which the wait state has an interest are currently pending. (3) Out-of-band state. This is a special work state. Transitions to normal states can be overridden when an unexpected event occurs (eg. I/O error). Instead the dispatcher disables and clears the OOB event and transits to the specified work state. This then acts as an ordinary work state, though object->state points to the overridden destination. Returning NO_TRANSIT resumes the overridden transition. In addition, the states have names in their definitions, so there's no need for tables of state names. Further, the EV_REQUEUE event is no longer necessary as that is automatic for work states. Since the states are now separate structs rather than values in an enum, it's not possible to use comparisons other than (non-)equality between them, so use some object->flags to indicate what phase an object is in. The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one (EV_KILL). An object flag now carries the information about retirement. Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged into an KILL_OBJECT state and additional states have been added for handling waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS). A state has also been added for synchronising with parent object initialisation (WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY). Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com> |
|
David Howells | 493f7bc114 |
FS-Cache: Wrap checks on object state
Wrap checks on object state (mostly outside of fs/fscache/object.c) with inline functions so that the mechanism can be replaced. Some of the state checks within object.c are left as-is as they will be replaced. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com> |
|
David Howells | 610be24ee4 |
FS-Cache: Uninline fscache_object_init()
Uninline fscache_object_init() so as not to expose some of the FS-Cache internals to the cache backend. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Milosz Tanski <milosz@adfin.com> Acked-by: Jeff Layton <jlayton@redhat.com> |
|
David Howells | 969695215f |
FS-Cache: Add transition to handle invalidate immediately after lookup
Add a missing transition to the FS-Cache object state machine to handle an invalidation event occuring between the back end completing the object lookup by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE) and the backend returning to fscache_lookup_object() and thence to fscache_object_state_machine() which then does a goto lookup_transit to handle the transition - but lookup_transit doesn't handle EV_INVALIDATE. Without this, the following BUG can be logged: FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE ------------[ cut here ]------------ kernel BUG at fs/fscache/object.c:357! Where event 2 is EV_INVALIDATE. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 8d76349d35 |
FS-Cache: Exclusive op submission can BUG if there's been an I/O error
The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG if there's been an I/O error because it may see the parent cache object in an unexpected state. It should only BUG if there hasn't been an I/O error. In this case the problem was produced by remounting the cache partition to be R/O. The EROFS state was detected and the cache was aborted, but not everything handled the aborting correctly. SysRq : Emergency Remount R/O EXT4-fs (sda6): re-mounted. Opts: (null) Emergency Remount complete CacheFiles: I/O Error: Failed to update xattr with error -30 FS-Cache: Cache cachefiles stopped due to I/O error ------------[ cut here ]------------ kernel BUG at fs/fscache/operation.c:128! invalid opcode: 0000 [#1] SMP CPU 0 Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093 /DG965RY RIP: 0010:[<ffffffffa00739c0>] [<ffffffffa00739c0>] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache] RSP: 0018:ffff880000853d40 EFLAGS: 00010206 RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0 RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280 RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268 R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162 FS: 0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040) Stack: ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260 ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198 ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308 Call Trace: [<ffffffffa0072375>] fscache_object_work_func+0x792/0xe65 [fscache] [<ffffffff81047e44>] process_one_work+0x1eb/0x37f [<ffffffff81047de6>] ? process_one_work+0x18d/0x37f [<ffffffffa0071be3>] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache] [<ffffffff810482e4>] worker_thread+0x15a/0x21a [<ffffffff8104818a>] ? rescuer_thread+0x188/0x188 [<ffffffff8104bf96>] kthread+0x7f/0x87 [<ffffffff813ad6f4>] kernel_thread_helper+0x4/0x10 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0 [<ffffffff813abd1d>] ? retint_restore_args+0xe/0xe [<ffffffff8104bf17>] ? __init_kthread_worker+0x53/0x53 [<ffffffff813ad6f0>] ? gs_change+0xb/0xb Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 03acc4be5e |
FS-Cache: Initialise the object event mask with the calculated mask
Initialise the object event mask with the calculated mask rather than unmasking undefined events also. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | ef778e7ae6 |
FS-Cache: Provide proper invalidation
Provide a proper invalidation method rather than relying on the netfs retiring the cookie it has and getting a new one. The problem with this is that isn't easy for the netfs to make sure that it has completed/cancelled all its outstanding storage and retrieval operations on the cookie it is retiring. Instead, have the cache provide an invalidation method that will cancel or wait for all currently outstanding operations before invalidating the cache, and will cause new operations to queue up behind that. Whilst invalidation is in progress, some requests will be rejected until the cache can stack a barrier on the operation queue to cause new operations to be deferred behind it. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 9f10523f89 |
FS-Cache: Fix operation state management and accounting
Fix the state management of internal fscache operations and the accounting of what operations are in what states. This is done by: (1) Give struct fscache_operation a enum variable that directly represents the state it's currently in, rather than spreading this knowledge over a bunch of flags, who's processing the operation at the moment and whether it is queued or not. This makes it easier to write assertions to check the state at various points and to prevent invalid state transitions. (2) Add an 'operation complete' state and supply a function to indicate the completion of an operation (fscache_op_complete()) and make things call it. The final call to fscache_put_operation() can then check that an op in the appropriate state (complete or cancelled). (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better govern the state of an object: (a) The ->n_ops is now the number of extant operations on the object and is now decremented by fscache_put_operation() only. (b) The ->n_in_progress is simply the number of objects that have been taken off of the object's pending queue for the purposes of being run. This is decremented by fscache_op_complete() only. (c) The ->n_exclusive is the number of exclusive ops that have been submitted and queued or are in progress. It is decremented by fscache_op_complete() and by fscache_cancel_op(). fscache_put_operation() and fscache_operation_gc() now no longer try to clean up ->n_exclusive and ->n_in_progress. That was leading to double decrements against fscache_cancel_op(). fscache_cancel_op() now no longer decrements ->n_ops. That was leading to double decrements against fscache_put_operation(). fscache_submit_exclusive_op() now decides whether it has to queue an op based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter will persist in being true even after all preceding operations have been cancelled or completed. Furthermore, if an object is active and there are runnable ops against it, there must be at least one op running. (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and provide a function to record completion of the pages as they complete. When n_pages reaches 0, the operation is deemed to be complete and fscache_op_complete() is called. Add calls to fscache_retrieval_complete() anywhere we've finished with a page we've been given to read or allocate for. This includes places where we just return pages to the netfs for reading from the server and where accessing the cache fails and we discard the proposed netfs page. The bugs in the unfixed state management manifest themselves as oopses like the following where the operation completion gets out of sync with return of the cookie by the netfs. This is possible because the cache unlocks and returns all the netfs pages before recording its completion - which means that there's nothing to stop the netfs discarding them and returning the cookie. FS-Cache: Cookie 'NFS.fh' still has outstanding reads ------------[ cut here ]------------ kernel BUG at fs/fscache/cookie.c:519! invalid opcode: 0000 [#1] SMP CPU 1 Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090 /DG965RY RIP: 0010:[<ffffffffa007050a>] [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache] RSP: 0018:ffff8800368cfb00 EFLAGS: 00010282 RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000 RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000 R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98 R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370 FS: 0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040) Stack: ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0 ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0 ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91 Call Trace: [<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs] [<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs] [<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs] [<ffffffff810d8d47>] evict+0xa1/0x15c [<ffffffff810d8e2e>] dispose_list+0x2c/0x38 [<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b [<ffffffff810c56b7>] prune_super+0xd5/0x140 [<ffffffff8109b615>] shrink_slab+0x102/0x1ab [<ffffffff8109d690>] balance_pgdat+0x2f2/0x595 [<ffffffff8103e009>] ? process_timeout+0xb/0xb [<ffffffff8109dba3>] kswapd+0x270/0x289 [<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46 [<ffffffff8109d933>] ? balance_pgdat+0x595/0x595 [<ffffffff8104bf7a>] kthread+0x7f/0x87 [<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0 [<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe [<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53 [<ffffffff813ad6b0>] ? gs_change+0xb/0xb Signed-off-by: David Howells <dhowells@redhat.com> |
|
Tejun Heo | 8b8edefa2f |
fscache: convert object to use workqueue instead of slow-work
Make fscache object state transition callbacks use workqueue instead of slow-work. New dedicated unbound CPU workqueue fscache_object_wq is created. get/put callbacks are renamed and modified to take @object and called directly from the enqueue wrapper and the work function. While at it, make all open coded instances of get/put to use fscache_get/put_object(). * Unbound workqueue is used. * work_busy() output is printed instead of slow-work flags in object debugging outputs. They mean basically the same thing bit-for-bit. * sysctl fscache.object_max_active added to control concurrency. The default value is nr_cpus clamped between 4 and WQ_UNBOUND_MAX_ACTIVE. * slow_work_sleep_till_thread_needed() is replaced with fscache private implementation fscache_object_sleep_till_congested() which waits on fscache_object_wq congestion. * debugfs support is dropped for now. Tracing API based debug facility is planned to be added. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com> |
|
David Howells | a53f4f9efa |
SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG
CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all instances. Change the remaining instances. This makes the debugfs file display the time mark and the owner's description again. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Howells | fee096deb4 |
CacheFiles: Catch an overly long wait for an old active object
Catch an overly long wait for an old, dying active object when we want to replace it with a new one. The probability is that all the slow-work threads are hogged, and the delete can't get a look in. What we do instead is: (1) if there's nothing in the slow work queue, we sleep until either the dying object has finished dying or there is something in the slow work queue behind which we can queue our object. (2) if there is something in the slow work queue, we return ETIMEDOUT to fscache_lookup_object(), which then puts us back on the slow work queue, presumably behind the deletion that we're blocked by. We are then deferred for a while until we work our way back through the queue - without blocking a slow-work thread unnecessarily. A backtrace similar to the following may appear in the log without this patch: INFO: task kslowd004:5711 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kslowd004 D 0000000000000000 0 5711 2 0x00000080 ffff88000340bb80 0000000000000046 ffff88002550d000 0000000000000000 ffff88002550d000 0000000000000007 ffff88000340bfd8 ffff88002550d2a8 000000000000ddf0 00000000000118c0 00000000000118c0 ffff88002550d2a8 Call Trace: [<ffffffff81058e21>] ? trace_hardirqs_on+0xd/0xf [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles] [<ffffffffa011c4e1>] cachefiles_wait_bit+0x9/0xd [cachefiles] [<ffffffff81353153>] __wait_on_bit+0x43/0x76 [<ffffffff8111ae39>] ? ext3_xattr_get+0x1ec/0x270 [<ffffffff813531ef>] out_of_line_wait_on_bit+0x69/0x74 [<ffffffffa011c4d8>] ? cachefiles_wait_bit+0x0/0xd [cachefiles] [<ffffffff8104c125>] ? wake_bit_function+0x0/0x2e [<ffffffffa011bc79>] cachefiles_mark_object_active+0x203/0x23b [cachefiles] [<ffffffffa011c209>] cachefiles_walk_to_object+0x558/0x827 [cachefiles] [<ffffffffa011a429>] cachefiles_lookup_object+0xac/0x12a [cachefiles] [<ffffffffa00aa1e9>] fscache_lookup_object+0x1c7/0x214 [fscache] [<ffffffffa00aafc5>] fscache_object_state_machine+0xa5/0x52d [fscache] [<ffffffffa00ab4ac>] fscache_object_slow_work_execute+0x5f/0xa0 [fscache] [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308 [<ffffffff8104be91>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8104be17>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20 1 lock held by kslowd004/5711: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [<ffffffffa011be64>] cachefiles_walk_to_object+0x1b3/0x827 [cachefiles] Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 868411be3f |
FS-Cache: Actually requeue an object when requested
FS-Cache objects have an FSCACHE_OBJECT_EV_REQUEUE event that can theoretically be raised to ask the state machine to requeue the object for further processing before the work function returns to the slow-work facility. However, fscache_object_work_execute() was clearing that bit before checking the event mask to see whether the object has any pending events that require it to be requeued immediately. Instead, the bit should be cleared after the check and enqueue. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 60d543ca72 |
FS-Cache: Start processing an object's operations on that object's death
Start processing an object's operations when that object moves into the DYING state as the object cannot be destroyed until all its outstanding operations have completed. Furthermore, make sure that read and allocation operations handle being woken up on a dead object. Such events are recorded in the Allocs.abt and Retrvls.abt statistics as viewable through /proc/fs/fscache/stats. The code for waiting for object activation for the read and allocation operations is also extracted into its own function as it is much the same in all cases, differing only in the stats incremented. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | d461d26dde |
FS-Cache: Make sure FSCACHE_COOKIE_LOOKING_UP cleared on lookup failure
We must make sure that FSCACHE_COOKIE_LOOKING_UP is cleared on lookup failure (if an object reaches the LC_DYING state), and we should clear it before clearing FSCACHE_COOKIE_CREATING. If this doesn't happen then fscache_wait_for_deferred_lookup() may hold allocation and retrieval operations indefinitely until they're interrupted by signals - which in turn pins the dying object until they go away. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 6897e3df8f |
FS-Cache: The object-available state can't rely on the cookie to be available
The object-available state in the object processing state machine (as processed by fscache_object_available()) can't rely on the cookie to be available because the FSCACHE_COOKIE_CREATING bit may have been cleared by fscache_obtained_object() prior to the object being put into the FSCACHE_OBJECT_AVAILABLE state. Clearing the FSCACHE_COOKIE_CREATING bit on a cookie permits __fscache_relinquish_cookie() to proceed and detach the cookie from the object. To deal with this, we don't dereference object->cookie in fscache_object_available() if the object has already been detached. In addition, a couple of assertions are added into fscache_drop_object() to make sure the object is unbound from the cookie before it gets there. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 52bd75fdb1 |
FS-Cache: Add counters for entry/exit to/from cache operation functions
Count entries to and exits from cache operation table functions. Maintain these as a single counter that's added to or removed from as appropriate. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 4fbf4291aa |
FS-Cache: Allow the current state of all objects to be dumped
Allow the current state of all fscache objects to be dumped by doing: cat /proc/fs/fscache/objects By default, all objects and all fields will be shown. This can be restricted by adding a suitable key to one of the caller's keyrings (such as the session keyring): keyctl add user fscache:objlist "<restrictions>" @s The <restrictions> are: K Show hexdump of object key (don't show if not given) A Show hexdump of object aux data (don't show if not given) And paired restrictions: C Show objects that have a cookie c Show objects that don't have a cookie B Show objects that are busy b Show objects that aren't busy W Show objects that have pending writes w Show objects that don't have pending writes R Show objects that have outstanding reads r Show objects that don't have outstanding reads S Show objects that have slow work queued s Show objects that don't have slow work queued If neither side of a restriction pair is given, then both are implied. For example: keyctl add user fscache:objlist KB @s shows objects that are busy, and lists their object keys, but does not dump their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is not implied. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 440f0affe2 |
FS-Cache: Annotate slow-work runqueue proc lines for FS-Cache work items
Annotate slow-work runqueue proc lines for FS-Cache work items. Objects include the object ID and the state. Operations include the object ID, the operation ID and the operation type and state. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 3d7a641e54 |
SLOW_WORK: Wait for outstanding work items belonging to a module to clear
Wait for outstanding slow work items belonging to a module to clear when unregistering that module as a user of the facility. This prevents the put_ref code of a work item from being taken away before it returns. Signed-off-by: David Howells <dhowells@redhat.com> |
|
David Howells | 36c9559022 |
FS-Cache: Object management state machine
Implement the cache object management state machine. The following documentation is added to illuminate the working of this state machine. It will also be added as: Documentation/filesystems/caching/object.txt ==================================================== IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT ==================================================== ============== REPRESENTATION ============== FS-Cache maintains an in-kernel representation of each object that a netfs is currently interested in. Such objects are represented by the fscache_cookie struct and are referred to as cookies. FS-Cache also maintains a separate in-kernel representation of the objects that a cache backend is currently actively caching. Such objects are represented by the fscache_object struct. The cache backends allocate these upon request, and are expected to embed them in their own representations. These are referred to as objects. There is a 1:N relationship between cookies and objects. A cookie may be represented by multiple objects - an index may exist in more than one cache - or even by no objects (it may not be cached). Furthermore, both cookies and objects are hierarchical. The two hierarchies correspond, but the cookies tree is a superset of the union of the object trees of multiple caches: NETFS INDEX TREE : CACHE 1 : CACHE 2 : : : +-----------+ : +----------->| IObject | : +-----------+ | : +-----------+ : | ICookie |-------+ : | : +-----------+ | : | : +-----------+ | +------------------------------>| IObject | | : | : +-----------+ | : V : | | : +-----------+ : | V +----------->| IObject | : | +-----------+ | : +-----------+ : | | ICookie |-------+ : | : V +-----------+ | : | : +-----------+ | +------------------------------>| IObject | +-----+-----+ : | : +-----------+ | | : | : | V | : V : | +-----------+ | : +-----------+ : | | ICookie |------------------------->| IObject | : | +-----------+ | : +-----------+ : | | V : | : V | +-----------+ : | : +-----------+ | | ICookie |-------------------------------->| IObject | | +-----------+ : | : +-----------+ V | : V : | +-----------+ | : +-----------+ : | | DCookie |------------------------->| DObject | : | +-----------+ | : +-----------+ : | | : : | +-------+-------+ : : | | | : : | V V : : V +-----------+ +-----------+ : : +-----------+ | DCookie | | DCookie |------------------------>| DObject | +-----------+ +-----------+ : : +-----------+ : : In the above illustration, ICookie and IObject represent indices and DCookie and DObject represent data storage objects. Indices may have representation in multiple caches, but currently, non-index objects may not. Objects of any type may also be entirely unrepresented. As far as the netfs API goes, the netfs is only actually permitted to see pointers to the cookies. The cookies themselves and any objects attached to those cookies are hidden from it. =============================== OBJECT MANAGEMENT STATE MACHINE =============================== Within FS-Cache, each active object is managed by its own individual state machine. The state for an object is kept in the fscache_object struct, in object->state. A cookie may point to a set of objects that are in different states. Each state has an action associated with it that is invoked when the machine wakes up in that state. There are four logical sets of states: (1) Preparation: states that wait for the parent objects to become ready. The representations are hierarchical, and it is expected that an object must be created or accessed with respect to its parent object. (2) Initialisation: states that perform lookups in the cache and validate what's found and that create on disk any missing metadata. (3) Normal running: states that allow netfs operations on objects to proceed and that update the state of objects. (4) Termination: states that detach objects from their netfs cookies, that delete objects from disk, that handle disk and system errors and that free up in-memory resources. In most cases, transitioning between states is in response to signalled events. When a state has finished processing, it will usually set the mask of events in which it is interested (object->event_mask) and relinquish the worker thread. Then when an event is raised (by calling fscache_raise_event()), if the event is not masked, the object will be queued for processing (by calling fscache_enqueue_object()). PROVISION OF CPU TIME --------------------- The work to be done by the various states is given CPU time by the threads of the slow work facility (see Documentation/slow-work.txt). This is used in preference to the workqueue facility because: (1) Threads may be completely occupied for very long periods of time by a particular work item. These state actions may be doing sequences of synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, getxattr, truncate, unlink, rmdir, rename). (2) Threads may do little actual work, but may rather spend a lot of time sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded workqueues don't necessarily have the right numbers of threads. LOCKING SIMPLIFICATION ---------------------- Because only one worker thread may be operating on any particular object's state machine at once, this simplifies the locking, particularly with respect to disconnecting the netfs's representation of a cache object (fscache_cookie) from the cache backend's representation (fscache_object) - which may be requested from either end. ================= THE SET OF STATES ================= The object state machine has a set of states that it can be in. There are preparation states in which the object sets itself up and waits for its parent object to transit to a state that allows access to its children: (1) State FSCACHE_OBJECT_INIT. Initialise the object and wait for the parent object to become active. In the cache, it is expected that it will not be possible to look an object up from the parent object, until that parent object itself has been looked up. There are initialisation states in which the object sets itself up and accesses disk for the object metadata: (2) State FSCACHE_OBJECT_LOOKING_UP. Look up the object on disk, using the parent as a starting point. FS-Cache expects the cache backend to probe the cache to see whether this object is represented there, and if it is, to see if it's valid (coherency management). The cache should call fscache_object_lookup_negative() to indicate lookup failure for whatever reason, and should call fscache_obtained_object() to indicate success. At the completion of lookup, FS-Cache will let the netfs go ahead with read operations, no matter whether the file is yet cached. If not yet cached, read operations will be immediately rejected with ENODATA until the first known page is uncached - as to that point there can be no data to be read out of the cache for that file that isn't currently also held in the pagecache. (3) State FSCACHE_OBJECT_CREATING. Create an object on disk, using the parent as a starting point. This happens if the lookup failed to find the object, or if the object's coherency data indicated what's on disk is out of date. In this state, FS-Cache expects the cache to create The cache should call fscache_obtained_object() if creation completes successfully, fscache_object_lookup_negative() otherwise. At the completion of creation, FS-Cache will start processing write operations the netfs has queued for an object. If creation failed, the write ops will be transparently discarded, and nothing recorded in the cache. There are some normal running states in which the object spends its time servicing netfs requests: (4) State FSCACHE_OBJECT_AVAILABLE. A transient state in which pending operations are started, child objects are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary lookup data is freed. (5) State FSCACHE_OBJECT_ACTIVE. The normal running state. In this state, requests the netfs makes will be passed on to the cache. (6) State FSCACHE_OBJECT_UPDATING. The state machine comes here to update the object in the cache from the netfs's records. This involves updating the auxiliary data that is used to maintain coherency. And there are terminal states in which an object cleans itself up, deallocates memory and potentially deletes stuff from disk: (7) State FSCACHE_OBJECT_LC_DYING. The object comes here if it is dying because of a lookup or creation error. This would be due to a disk error or system error of some sort. Temporary data is cleaned up, and the parent is released. (8) State FSCACHE_OBJECT_DYING. The object comes here if it is dying due to an error, because its parent cookie has been relinquished by the netfs or because the cache is being withdrawn. Any child objects waiting on this one are given CPU time so that they too can destroy themselves. This object waits for all its children to go away before advancing to the next state. (9) State FSCACHE_OBJECT_ABORT_INIT. The object comes to this state if it was waiting on its parent in FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself so that the parent may proceed from the FSCACHE_OBJECT_DYING state. (10) State FSCACHE_OBJECT_RELEASING. (11) State FSCACHE_OBJECT_RECYCLING. The object comes to one of these two states when dying once it is rid of all its children, if it is dying because the netfs relinquished its cookie. In the first state, the cached data is expected to persist, and in the second it will be deleted. (12) State FSCACHE_OBJECT_WITHDRAWING. The object transits to this state if the cache decides it wants to withdraw the object from service, perhaps to make space, but also due to error or just because the whole cache is being withdrawn. (13) State FSCACHE_OBJECT_DEAD. The object transits to this state when the in-memory object record is ready to be deleted. The object processor shouldn't ever see an object in this state. THE SET OF EVENTS ----------------- There are a number of events that can be raised to an object state machine: (*) FSCACHE_OBJECT_EV_UPDATE The netfs requested that an object be updated. The state machine will ask the cache backend to update the object, and the cache backend will ask the netfs for details of the change through its cookie definition ops. (*) FSCACHE_OBJECT_EV_CLEARED This is signalled in two circumstances: (a) when an object's last child object is dropped and (b) when the last operation outstanding on an object is completed. This is used to proceed from the dying state. (*) FSCACHE_OBJECT_EV_ERROR This is signalled when an I/O error occurs during the processing of some object. (*) FSCACHE_OBJECT_EV_RELEASE (*) FSCACHE_OBJECT_EV_RETIRE These are signalled when the netfs relinquishes a cookie it was using. The event selected depends on whether the netfs asks for the backing object to be retired (deleted) or retained. (*) FSCACHE_OBJECT_EV_WITHDRAW This is signalled when the cache backend wants to withdraw an object. This means that the object will have to be detached from the netfs's cookie. Because the withdrawing releasing/retiring events are all handled by the object state machine, it doesn't matter if there's a collision with both ends trying to sever the connection at the same time. The state machine can just pick which one it wants to honour, and that effects the other. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com> |