Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the write path, this is calling
end_page_writeback and removing the head request from an inode.
Both of these operations should not be called until all requests
in a page group have reached the point where they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations that modify state for a whole page must be syncronized across
all requests within a page group. In the read path, this is calling
unlock_page and SetPageUptodate. Both of these functions should not be
called until all requests in a page group have reached the point where
they would call them.
This patch should have no effect yet since all page groups currently
have one request, but will come into play when pg_test functions are
modified to split pages into sub-page regions.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Add "page groups" - a circular list of nfs requests (struct nfs_page)
that all reference the same page. This gives nfs read and write paths
the ability to account for sub-page regions independently. This
somewhat follows the design of struct buffer_head's sub-page
accounting.
Only "head" requests are ever added/removed from the inode list in
the buffered write path. "head" and "sub" requests are treated the
same through the read path and the rest of the write/commit path.
Requests are given an extra reference across the life of the list.
Page groups are never rejoined after being split. If the read/write
request fails and the client falls back to another path (ie revert
to MDS in PNFS case), the already split requests are pushed through
the recoalescing code again, which may split them further and then
coalesce them into properly sized requests on the wire. Fragmentation
shouldn't be a problem with the current design, because we flush all
requests in page group when a non-contiguous request is added, so
the only time resplitting should occur is on a resend of a read or
write.
This patch lays the groundwork for sub-page splitting, but does not
actually do any splitting. For now all page groups have one request
as pg_test functions don't yet split pages. There are several related
patches that are needed support multiple requests per page group.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This is a step toward allowing pg_test to inform the the
coalescing code to reduce the size of requests so they may fit in
whatever scheme the pg_test callback wants to define.
For now, just return the size of the request if there is space, or 0
if there is not. This shouldn't change any behavior as it acts
the same as when the pg_test functions returned bool.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
@inode is passed but not used.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Remove unused flags PG_NEED_COMMIT and PG_NEED_RESCHED.
Add comments describing how each flag is used.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Most of this code is the same for both the read and write paths, so
combine everything and use the rw_ops when necessary.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Combining these functions will let me make a single nfs_rw_common_ops
struct (see the next patch).
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The read and write paths do exactly the same thing for the rpc_prepare
rpc_op. This patch combines them together into a single function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I create a new struct nfs_rw_ops to decide the differences between reads
and writes. This struct will be set when initializing a new
nfs_pgio_descriptor, and then passed on to the nfs_rw_header when a new
header is allocated.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
To allow layout driver to pass private information around
pg_init/pg_doio.
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The 'committed' field is not needed once we have put the struct nfs_page
on the right list.
Also correct the type of the verifier: it is not an array of __be32, but
simply an 8 byte long opaque array.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Function rename to ensure that the functionality of nfs_unlock_request()
mirrors that of nfs_lock_request(). Then let nfs_unlock_and_release_request()
do the work of what used to be called nfs_unlock_request()...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
We only have two places where we need to grab a reference when trying
to lock the nfs_page. We're better off making that explicit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
We have to unlock the nfs_page before we call nfs_end_page_writeback
to avoid races with functions that expect the page to be unlocked
when PG_locked and PG_writeback are not set.
The problem is that nfs_unlock_request also releases the nfs_page,
causing a deadlock if the release of the nfs_open_context
triggers an iput() while the PG_writeback flag is still set...
The solution is to separate the unlocking and release of the nfs_page,
so that we can do the former before nfs_end_page_writeback and the
latter after.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
This also has the advantage that it allows directio to use pnfs.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Factors out the code that will need to change when directio
starts using these code paths. This will allow directio to use
the generic pagein and flush routines
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Decouple nfs_pgio_header and nfs_read_data, and have (possibly
multiple) nfs_read_datas each take a refcount on nfs_pgio_header.
For the moment keeps nfs_read_header as a way to preallocate a single
nfs_read_data with the nfs_pgio_header. The code doesn't need this,
and would be prettier without, but given the amount of churn I am
already introducing I didn't want to play with tuning new mempools.
This also fixes bug in pnfs_ld_handle_read_error. In the case of
desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing
replay attempt to do nothing.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move more pnfs-isms out of the generic commit code.
Bugfixes:
- filelayout_scan_commit_lists doesn't need to get/put the lseg.
In fact since it is run under the inode->i_lock, the lseg_put()
can deadlock.
- Ensure that we distinguish between what needs to be done for
commit-to-data server and what needs to be done for commit-to-MDS
using the new flag PG_COMMIT_TO_DS. Otherwise we may end up calling
put_lseg() on a bucket for a struct nfs_page that got written
through the MDS.
- Fix a case where we were using list_del() on an nfs_page->wb_list
instead of list_del_init().
- filelayout_initiate_commit needs to call filelayout_commit_release
on error instead of the mds_ops->rpc_release(). Otherwise it won't
clear the commit lock.
Cleanups:
- Let the files layout manage the commit lists for the pNFS case.
Don't expose stuff like pnfs_choose_commit_list, and the fact
that the commit buckets hold references to the layout segment
in common code.
- Cast out the put_lseg() calls for the struct nfs_read/write_data->lseg
into the pNFS layer from whence they came.
- Let the pNFS layer manage the NFS_INO_PNFS_COMMIT bit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Fred Isaman <iisaman@netapp.com>
The radix tree is only being used to compile lists of reqs needing commit.
It is simpler to just put the reqs directly into a list.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The last real use of this tag was removed by
commit 7f2f12d963 NFS: Simplify nfs_wb_page()
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't rely on the PageError flag to tell us if one of the partial reads of
the page failed. Instead, replace that with a dedicated flag in the
struct nfs_page.
Then clean out redundant uses of the PageError flag: the VM no longer
checks it for reads.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
differences in block sizes when falling back to write through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
...and ensure that we recoalese to take into account differences in
block sizes when falling back to read through the MDS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If an attempt to do pNFS fails, and we have to fall back to writing through
the MDS, then we may want to re-coalesce the requests that we already have
since the block size for the MDS read/writes may be different to that of
the DS read/writes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to ensure that the layouts are set up before we can decide to
coalesce requests. To do so, we want to further split up the struct
nfs_pageio_descriptor operations into an initialisation callback, a
coalescing test callback, and a 'do i/o' callback.
This patch cleans up the existing callback methods before adding the
'initialisation' callback.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Otherwise we end up overflowing the rpc buffer size on the receive end.
Signed-off-by: Benny Halevy <benny@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that the inode scalability patches have been merged, it is no longer
safe to call igrab() under the inode->i_lock.
Now that we no longer call nfs_clear_request() until the nfs_page is
being freed, we know that we are always holding a reference to the
nfs_open_context, which again holds a reference to the path, and so
the inode cannot be freed until the last nfs_page has been removed
from the radix tree and freed.
We can therefore skip the igrab()/iput() altogether.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We create three major hooks for the pnfs code.
pnfs_mark_request_commit() is called during writeback_done from
nfs_mark_request_commit, which gives the driver an opportunity to
claim it wants control over commiting a particular req.
pnfs_choose_commit_list() is called from nfs_scan_list
to choose which list a given req should be added to, based on
where we intend to send it for COMMIT. It is up to the driver
to have preallocated list headers for each destination it may need.
pnfs_commit_list() is how the driver actually takes control, it is
used instead of nfs_commit_list().
In order to pass information between the above functions, we create
a union in nfs_page to hold a lseg (which is possible because the req is
not on any list while in transition), and add some flags to indicate
if we need to use the pnfs code.
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If we're only doing a single write, and there are no other unstable
writes being queued up, we might want to just flip to using a stable
write RPC call.
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This will make it possible to clear the lseg pointer in the same
function as it is put, instead of in the caller nfs_pageio_doio().
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Move the pnfs_update_layout call location to nfs_pageio_do_add_request().
Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach
it to each nfs_read_data so it can be sent to the layout driver.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add a pg_test layout driver hook which is used to avoid coelescing I/O across
layout stripes.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When a nfs_page is freed, nfs_free_request is called which also calls
nfs_clear_request to clean out the lock and open contexts and free the
pagecache page.
However, a couple of places in the nfs code call nfs_clear_request
themselves. What happens here if the refcount on the request is still high?
We'll be releasing contexts and freeing pointers while the request is
possibly still in use.
Remove those bare calls to nfs_clear_context. That should only be done when
the request is being freed.
Note that when doing this, we need to watch out for tests of req->wb_page.
Previously, nfs_set_page_tag_locked() and nfs_clear_page_tag_locked()
would check the value of req->wb_page to figure out if the page is mapped
into the nfsi->nfs_page_tree. We now indicate the page is mapped using
the new bit PG_MAPPED in req->wb_flags .
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, if an unstable write completes, we cannot redirty the page in
order to reflect a new change in the page data until after we've sent a
COMMIT request.
This patch allows a page rewrite to proceed without the unnecessary COMMIT
step, putting it immediately back onto the dirty page list, undoing the
VM unstable write accounting, and removing the NFS_PAGE_TAG_COMMIT tag from
the NFS radix tree.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The addition of nfs_page_mkwrite means that We should no longer need to
create requests inside nfs_writepage()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When processes are allowed to attempt to lock a non-contiguous range of nfs
write requests, it is possible for generic_writepages to 'wrap round' the
address space, and call writepage() on a request that is already locked by
the same process.
We avoid the deadlock by checking if the page index is contiguous with the
list of nfs write requests that is already held in our
nfs_pageio_descriptor prior to attempting to lock a new request.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently we do write coalescing in a very inefficient manner: one pass in
generic_writepages() in order to lock the pages for writing, then one pass
in nfs_flush_mapping() and/or nfs_sync_mapping_wait() in order to gather
the locked pages for coalescing into RPC requests of size "wsize".
In fact, it turns out there is actually a deadlock possible here since we
only start I/O on the second pass. If the user signals the process while
we're in nfs_sync_mapping_wait(), for instance, then we may exit before
starting I/O on all the requests that have been queued up.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>