In ocfs2, fdatasync and fsync are identical.
I think fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.
#sysbench --num-threads=16 --max-requests=300000 --test=fileio
--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run
Results:
-2.6.30-rc8
Test execution summary:
total time: 107.1445s
total number of events: 119559
total time taken by event execution: 116.1050
per-request statistics:
min: 0.0000s
avg: 0.0010s
max: 0.1220s
approx. 95 percentile: 0.0016s
Threads fairness:
events (avg/stddev): 7472.4375/303.60
execution time (avg/stddev): 7.2566/0.64
-2.6.30-rc8-patched
Test execution summary:
total time: 86.8529s
total number of events: 300016
total time taken by event execution: 24.3077
per-request statistics:
min: 0.0000s
avg: 0.0001s
max: 0.0336s
approx. 95 percentile: 0.0001s
Threads fairness:
events (avg/stddev): 18751.0000/718.75
execution time (avg/stddev): 1.5192/0.05
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
It would be nice to know how often we get checksum failures. Even
better, how many of them we can fix with the single bit ecc. So, we add
a statistics structure. The structure can be installed into debugfs
wherever the user wants.
For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions. The stats are only
registered with debugfs when the filesystem supports metadata ecc.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock. When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert. The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.
A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.
This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
to also use this lock ordering.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
recovering local quota file. During this process we need to get quota
structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
again. This second lock can block in case some other node has requested the
quota file lock in the mean time. Fix the problem by moving quota file locking
down into the function where it is really needed. Then dqget() or dqput()
won't be called with the lock held.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We called vfs_dq_transfer() with global quota file lock held. This can lead
to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
this can block if another node requested the lock in the mean time.
Since we have to call vfs_dq_transfer() with transaction already started
and quota file lock ranks above the transaction start, we cannot just rely
on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
if they need it. We fix the problem by acquiring pointers to all quota
structures needed by vfs_dq_transfer() already before calling the function.
By this we are sure that all quota structures are properly allocated and
they can be freed only after we drop references to them. Thus we don't need
quota file lock anywhere inside vfs_dq_transfer().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This function is called with dqio_mutex held but it has to acquire lock
from global quota file which ranks above this lock. This is not deadlockable
lock inversion since this code path is take only during mount when noone
else can race with us but let's clean this up to silence lockdep.
We just drop the dqio_mutex in the beginning of the function and reacquire
it in the end since we don't need it - noone can race with us at this moment.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
It is not possible to get a read lock and then try to get the same write lock
in one thread as that can block on downconvert being requested by other node
leading to deadlock. So first drop the quota lock for reading and only after
that get it for writing.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ocfs2 was hand-calling vfs_follow_link(), but there's no point to that.
Let's use page_follow_link_light() and nd_set_link().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the mainline ocfs2 code, the interface for masklog is in files under
/sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h
reference the old /proc interface. They are out of date.
This patch modifies the comments in cluster/masklog.h, which also provides
a bash script example on how to change the log mask bits.
Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Currently the kernel defines XATTR_LIST_MAX as 65536
in include/linux/limits.h. This is the largest buffer that is used for
listing xattrs.
But with ocfs2 xattr tree, we actually have no limit for the number. If
filesystem has more names than can fit in the buffer, the kernel
logs will be pollluted with something like this when listing:
(27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34
(27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34
So don't print "ERROR" message as this is not an ocfs2 error.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Change repository in MAINTAINERS.
ocfs2: Fix a missing credit when deleting from indexed directories.
ocfs2/trivial: Remove unused variable in ocfs2_rename.
ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock()
ocfs2: Fix some printk() warnings.
ocfs2: Fix 2 warning during ocfs2 make.
ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir.
The ocfs2 directory index updates two blocks when we remove an entry -
the dx root and the dx leaf. OCFS2_DELETE_INODE_CREDITS was only
accounting for the dx leaf. This shows up when ocfs2_delete_inode()
runs out of credits in jbd2_journal_dirty_metadata() at
"J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".
The test that caught this was running dirop_file_racer from the
ocfs2-test suite with a 250-character filename PREFIX. Run on a 512B
blocksize, it forces the orphan dir index to grow large enough to
trigger.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
With indexed dir enabled, now we use ocfs2_dir_lookup_result to
wrap all the bh used for dir. So remove the 2 unused variables.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_dentry_attach_lock(), if unable to get the dentry lock, we need to
call iput(inode) because a failure here means no d_instantiate(), which means
the normally matching iput() will not be called during dput(dentry).
This patch fixes the oops that accompanies the following message:
(3996,1):dlm_empty_lockres:2708 ERROR: lockres W00000000000000000a1046b06a4382 still has local locks!
kernel BUG in dlm_empty_lockres at /rpmbuild/smushran/BUILD/ocfs2-1.4.2/fs/ocfs2/dlm/dlmmaster.c:2709!
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
fs/ocfs2/dir.c: In function ‘ocfs2_extend_dir’:
fs/ocfs2/dir.c:2700: warning: ‘ret’ may be used uninitialized in this function
fs/ocfs2/suballoc.c: In function ‘ocfs2_get_suballoc_slot_bit’:
fs/ocfs2/suballoc.c:2216: warning: comparison is always true due to limited range of data type
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Rearrange locking of i_mutex on destination and call to
ocfs2_rw_lock() so locks are only held while buffers are copied with
the pipe_to_file() actor, and not while waiting for more data on the
pipe.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
In ocfs2_expand_inline_dir, we calculate whether we need 1 extra
cluster if we can't store the dx inline the root and save it in
dx_alloc. So add it when we call ocfs2_reserve_clusters.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
There's a possible deadlock in generic_file_splice_write(),
splice_from_pipe() and ocfs2_file_splice_write():
- task A calls generic_file_splice_write()
- this calls inode_double_lock(), which locks i_mutex on both
pipe->inode and target inode
- ordering depends on inode pointers, can happen that pipe->inode is
locked first
- __splice_from_pipe() needs more data, calls pipe_wait()
- this releases lock on pipe->inode, goes to interruptible sleep
- task B calls generic_file_splice_write(), similarly to the first
- this locks pipe->inode, then tries to lock inode, but that is
already held by task A
- task A is interrupted, it tries to lock pipe->inode, but fails, as
it is already held by task B
- ABBA deadlock
Fix this by explicitly ordering locks: the outer lock must be on
target inode and the inner lock (which is later unlocked and relocked)
must be on pipe->inode. This is OK, pipe inodes and target inodes
form two nonoverlapping sets, generic_file_splice_write() and friends
are not called with a target which is a pipe.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.
If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.
This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.
This patch fixes above problem.
Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.
We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.
And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.
[mfasheh@suse.com: build warning fixes and comment cleanups]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The debugfs file, mle_state, now prints the number of largest number of mles
in one hash link.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch attempts to fix a fine race between purging and migration.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch removes struct dlm_lock_name and adds the entries directly
to struct dlm_master_list_entry. Under the new scheme, both mles that
are backed by a lockres or not, will have the name populated in mle->mname.
This allows us to get rid of code that was figuring out the location of
the mle name.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch shows the number of lockres' and mles in the debugfs file, dlm_state.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch inlines dlm_set_lockres_owner() and dlm_change_lockres_owner().
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch replaces the lockres counts that tracked the number number of
locally and remotely mastered lockres' with a current and total count. The
total count is the number of lockres' that have been created since the dlm
domain was created.
The number of locally and remotely mastered counts can be computed using
the locking_state output.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The lifetime of a mle is limited to the duration of the lockres mastery
process. While typically this lifetime is fairly short, we have noticed
the number of mles explode under certain circumstances. This patch tracks
the number of each different types of mles and should help us determine
how best to speed up the mastery process.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The previous patch explicitly did not indent dlm_cleanup_master_list()
so as to make the patch readable. This patch properly indents the
function.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
With this patch, the mles are stored in a hash and not a simple list.
This should improve the mle lookup time when the number of outstanding
masteries is large.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds code to create and destroy the dlm->master_hash.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch refactors dlm_clean_master_list() so as to make it
easier to convert the mle list to a hash.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For master mle, the name it stored in the attached lockres in struct qstr.
For block and migration mle, the name is stored inline in struct dlm_lock_name.
This patch attempts to make struct dlm_lock_name look like a struct qstr. While
we could use struct qstr, we don't because we want to avoid having to malloc
and free the lockname string as the mle's lifetime is fairly short.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch encapsulates adding and removing of the mle from the
dlm->master_list. This patch is part of the series of patches that
converts the mle list to a mle hash.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2, the block group search looks for the "emptiest" group
to allocate from. So if the allocator has many equally(or almost
equally) empty groups, new block group will tend to get spread
out amongst them.
So we add osb_inode_alloc_group in ocfs2_super to record the last
used inode allocation group.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.
I have done some basic test and the results are a ten times improvement on
some cold-cache stat workloads.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Inode groups used to be allocated from local alloc file,
but since we want all inodes to be contiguous enough, we
will try to allocate them directly from global_bitmap.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2, the inode block search looks for the "emptiest" inode
group to allocate from. So if an inode alloc file has many equally
(or almost equally) empty groups, new inodes will tend to get
spread out amongst them, which in turn can put them all over the
disk. This is undesirable because directory operations on conceptually
"nearby" inodes force a large number of seeks.
So we add ip_last_used_group in core directory inodes which records
the last used allocation group. Another field named ip_last_used_slot
is also added in case inode stealing happens. When claiming new inode,
we passed in directory's inode so that the allocation can use this
information.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
rebalancing. Since we rebalance an entire cluster at a time however, this
function needs to calculate the beginning of that cluster, in blocks. The
calculation was wrong, which would result in a read of non-leaf blocks. Fix
the calculation by adding ocfs2_block_to_cluster_start() which is a more
straight-forward way of determining this.
Reported-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_empty_dir() is far more expensive than checking link count. Since both
need to be checked at the same time, we can improve performance by checking
link count first.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Since the disk format is finalized, we can set this feature bit in the
supported mask.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
This little bit of extra accounting speeds up ocfs2_empty_dir()
dramatically by allowing us to short-circuit the full directory scan.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Since we've now got a directory format capable of handling a large number of
entries, we can increase the maximum link count supported. This only gets
increased if the directory indexing feature is turned on.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
The only operation which doesn't get faster with directory indexing is
insert, which still has to walk the entire unindexed directory portion to
find a free block. This patch provides an improvement in directory insert
performance by maintaining a singly linked list of directory leaf blocks
which have space for additional dirents.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Allow us to store a small number of directory index records in the
ocfs2_dx_root_block. This saves us a disk read on small to medium sized
directories (less than about 250 entries). The inline root is automatically
turned into a root block with extents if the directory size increases beyond
it's capacity.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.
Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Many directory manipulation calls pass around a tuple of dirent, and it's
containing buffer_head. Dir indexing has a bit more state, but instead of
adding yet more arguments to functions, we introduce 'struct
ocfs2_dir_lookup_result'. In this patch, it simply holds the same tuple, but
future patches will add more state.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
This patch removes the debugfs file local_alloc_stats as that information
is now included in the fs_state debugfs file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Move the definition of struct recovery_map from journal.c to journal.h. This
is preparation for the next patch.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates a debugfs file, o2hb/livesnodes, which exposes the
aggregate list of heartbeating node across all heartbeat regions.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Remove two unneeded exports and make two symbols static in fs/mpage.c
Cleanup after commit 585d3bc06f
Trim includes of fdtable.h
Don't crap into descriptor table in binfmt_som
Trim includes in binfmt_elf
Don't mess with descriptor table in load_elf_binary()
Get rid of indirect include of fs_struct.h
New helper - current_umask()
check_unsafe_exec() doesn't care about signal handlers sharing
New locking/refcounting for fs_struct
Take fs_struct handling to new file (fs/fs_struct.c)
Get rid of bumping fs_struct refcount in pivot_root(2)
Kill unsharing fs_struct in __set_personality()
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A long time ago, xs->base is allocated a 4K size and all the contents
in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to
abstract xattr bucket and xs->base is initialized to the start of the
bu_bhs[0]. So xs->base + offset will overflow when the value root is
stored outside the first block.
Then why we can survive the xattr test by now? It is because we always
read the bucket contiguously now and kernel mm allocate continguous
memory for us. We are lucky, but we should fix it. So just get the
right value root as other callers do.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We need to use le32_to_cpu to test rec->e_cpos in
ocfs2_dinode_insert_check.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Replace max_inline_data with max_inline_data_with_xattr
to ensure it correct when xattr inlined.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
If this is a new directory with inline data, we choose to
reserve the entire inline area for directory contents and
force an external xattr block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch set a gap (4 bytes) between xattr entry and
name/value when xattr in bucket. This gap use to seperate
entry and name/value when a bucket is full. It had already
been set when xattr in inode/block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ctime updating of xattr, it use the wrong type of access for
inode, so use ocfs2_journal_access_di instead.
Reported-and-Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In dlm_assert_master_handler(), if we get an incorrect assert master from a node
that, we reply with EINVAL asking the asserter to die. The problem is that an
assert is sent after so many hoops, it is invariably the node that thinks the
asserter is wrong, is actually wrong. So instead of killing the asserter, this
patch kills the assertee.
This patch papers over a race that is still being addressed.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The code was using dlm->spinlock instead of dlm->ast_lock to protect the
ast_list. This patch fixes the issue.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The dentry lock has a different format than other locks. This patch fixes
ocfs2_log_dlm_error() macro to make it print the dentry lock correctly.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mainline commit d4f7e650e5 attempts to delay
the dlm_thread from sending the drop ref message if the lockres is being
migrated. The problem is that we make the dlm_thread wait for the migration
to complete. This causes a deadlock as dlm_thread also participates in the
lockres migration process.
A better fix for the original oss bugzilla#1012 is in testing.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In __ocfs2_mark_extent_written, when we meet with the situation
of c_split_covers_rec, the old solution just replace the extent
record and forget to access and dirty the buffer_head. This will
cause a problem when the unwritten extent is in an extent block.
So access and dirty it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
If we race with commit code setting i_transaction to NULL, we could
possibly dereference it. Proper locking requires the journal pointer
(to access journal->j_list_lock), which we don't have. So we have to
change the prototype of the function so that filesystem passes us the
journal pointer. Also add a more detailed comment about why the
function jbd2_journal_begin_ordered_truncate() does what it does and
how it should be used.
Thanks to Dan Carpenter <error27@gmail.com> for pointing to the
suspitious code.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Joel Becker <joel.becker@oracle.com>
CC: linux-ext4@vger.kernel.org
CC: ocfs2-devel@oss.oracle.com
CC: mfasheh@suse.de
CC: Dan Carpenter <error27@gmail.com>
We weren't reclaiming the clusters which get free'd from this function,
so any user punching holes in a file would still have those bytes accounted
against him/her. Add the call to vfs_dq_free_space_nodirty() to fix this.
Interestingly enough, the journal credits calculation already took this into
account.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
When two nodes holding PR locks on a resource concurrently attempt to
upconvert the locks to EX, the master sends a BAST to one of the nodes. This
message tells that node to first cancel convert the upconvert request,
followed by downconvert to a NL. Only when this lock is downconverted to NL,
can the master upconvert the first node's lock to EX.
While the fs was doing the cancel convert, it was forgetting to wake up the
dc thread after a successful cancel, leading to a deadlock.
Reported-and-Tested-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_xattr_value_truncate, we may call b-tree codes which will
extend the journal transaction. It has a potential problem that it
may let the already-accessed-but-not-dirtied buffers gone. So we'd
better access the bucket after we call ocfs2_xattr_value_truncate.
And as for the root buffer for the xattr value, b-tree code will
acess and dirty it, so we don't need to worry about it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It could happen that some limit has been set via quotactl() and in parallel
->mark_dirty() is called from another thread doing e.g. dquot_alloc_space(). In
such case ocfs2_write_dquot() must not try to sync the dquot because that needs
global quota lock but that ranks above transaction start.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Since ->acquire_dquot and ->release_dquot callbacks aren't called under
dqptr_sem anymore, we don't have to start a transaction and obtain locks
so early. So we can just remove all this complicated stuff.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.de>
When I review ocfs2 code, find there are 2 typos to "successfull". After
doing grep "successfull " in kernel tree, 22 typos found totally -- great
minds always think alike :)
This patch fixes all the similar typos. Thanks for Randy's ack and comments.
Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits)
ocfs2: Access the right buffer_head in ocfs2_merge_rec_left.
ocfs2: use min_t in ocfs2_quota_read()
ocfs2: remove unneeded lvb casts
ocfs2: Add xattr support checking in init_security
ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle
ocfs2: calculate and reserve credits for xattr value in mknod
ocfs2/xattr: fix credits calculation during index create
ocfs2/xattr: Always updating ctime during xattr set.
ocfs2/xattr: Remove extend_trans call and add its credits from the beginning
ocfs2/dlm: Fix race during lockres mastery
ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list
ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating
ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler()
ocfs2/dlm: Fix a race between migrate request and exit domain
ocfs2: One more hamming code optimization.
ocfs2: Another hamming code optimization.
ocfs2: Don't hand-code xor in ocfs2_hamming_encode().
ocfs2: Enable metadata checksums.
ocfs2: Validate superblock with checksum and ecc.
ocfs2: Checksum and ECC for directory blocks.
...
... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.
i_mode is not worth the effort; it has no common default value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In commit "ocfs2: Use metadata-specific ocfs2_journal_access_*()
functions", the wrong buffer_head is accessed. So change it
to the right buffer_head.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlmglue.c has lots of code which casts the return value of ocfs2_dlm_lvb().
This is pointless however, as ocfs2_dlm_lvb() returns void *.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We must check whether ocfs2 volume support xattr in init_security,
if not support xattr and security is enable, would cause failure of mknod.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In extreme situation, may need xattr bucket for setting
security entry and acl entries during mknod. This only
happens when block size is too small.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We extend the credits for xattr's large value in set_value_outside
before, this can give rise to a credits issue when we set one security
entry and two acl entries duing mknod. As we remove extend_trans form
set_value_outside, we must calculate and reserve the credits for
xattr's large value in mknod.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When creating a xattr index block, the old calculation forget
to add credits for the meta change of the alloc file. So add
more credits and more comments to explain it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In xattr set, we should always update ctime if the operation goes
sucessfully. The old one mistakenly put it in ocfs2_xattr_set_entry
which is only called when we set xattr in inode or xattr block. The
side benefit is that it resolve the bug 1052 since in that scenario,
ocfs2_calc_xattr_set_need only calc out the xattr set credits while
ocfs2_xattr_set_entry update the inode also which isn't concerned with
the process of xattr set.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Actually, when setting a new xattr value, we know it from the very
beginning, and it isn't like the extension of bucket in which case
we can't figure it out. So remove ocfs2_extend_trans in that function
and calculate it before the transaction. It also relieve acl operation
from the worry about the side effect of ocfs2_extend_trans.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlm_get_lock_resource() is supposed to return a lock resource with a proper
master. If multiple concurrent threads attempt to lookup the lockres for the
same lockid while the lock mastery in underway, one or more threads are likely
to return a lockres without a proper master.
This patch makes the threads wait in dlm_get_lock_resource() while the mastery
is underway, ensuring all threads return the lockres with a proper master.
This issue is known to be limited to users using the flock() syscall. For all
other fs operations, the ocfs2 dlmglue layer serializes the dlm op for each
lockid.
Users encountering this bug will see flock() return EINVAL and dmesg have the
following error:
ERROR: Dlm error "DLM_BADARGS" while calling dlmlock on resource <LOCKID>: bad api args
Reported-by: Coly Li <coyli@suse.de>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds a new lock, dlm->tracking_lock, to protect adding/removing
lockres' to/from the dlm->tracking_list. We were previously using dlm->spinlock
for the same, but that proved inadequate as we could be freeing a lockres from
a context that did not hold that lock. As the new lock only protects this list,
we can explicitly take it when removing the lockres from the tracking list.
This bug was exposed when testing multiple processes concurrently flock() the
same file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
During lockres purge, o2dlm sends a drop reference message to the lockres
master. This patch delays the message if the lockres is being migrated.
Fixes oss bugzilla#1012
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1012
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch cleans printed errors in dlm_proxy_ast_handler(). The errors now includes
the node number that sent the (b)ast. Also it reduces the number of endian swaps
of the cookie.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch address a racing migrate request message and an exit domain message.
Instead of blocking exit domains for the duration of the migrate, we ignore
failure to deliver that message. This is because an exiting domain should
not have any active locks and thus has no role to play in the migration.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The previous optimization used a fast find-highest-bit-set operation to
give us a good starting point in calc_code_bit(). This version lets the
caller cache the previous code buffer bit offset. Thus, the next call
always starts where the last one left off.
This reduces the calculation another 39%, for a total 80% reduction from
the original, naive implementation. At least, on my machine. This also
brings the parity calculation to within an order of magnitude of the
crc32 calculation.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In the calc_code_bit() function, we must find all powers of two beneath
the code bit number, *after* it's shifted by those powers of two. This
requires a loop to see where it ends up.
We can optimize it by starting at its most significant bit. This shaves
32% off the time, for a total of 67.6% shaved off of the original, naive
implementation.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When I wrote ocfs2_hamming_encode(), I was following documentation of
the algorithm and didn't have quite the (possibly still imperfect) grasp
of it I do now. As part of this, I literally hand-coded xor. I would
test a bit, and then add that bit via xor to the parity word.
I can, of course, just do a single xor of the parity word and the source
word (the code buffer bit offset). This cuts CPU usage by 53% on a
mostly populated buffer (an inode containing utmp.h inline).
Joel
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add OCFS2_FEATURE_INCOMPAT_META_ECC to the list of supported features.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The superblock is read via a raw call. Validate it after we find it
from its signature.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use the db_check field of ocfs2_dir_block_trailer to crc/ecc the
dirblocks.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Future ocfs2 features metaecc and indexed directories need to store a
little bit of data in each dirblock. For compatibility, we place this
in a trailer at the end of the dirblock. The trailer plays itself as an
empty dirent, so that if the features are turned off, it can be reused
without requiring a tunefs scan.
This code adds the trailer and validates it when the block is read in.
[ Mark is the original author, but I reinserted this code before his
dir index work. -- Joel ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the rest of the naked ocfs2_journal_access() calls in
fs/ocfs2/xattr.c to use the appropriate ocfs2_journal_access_*() call
for their metadata type.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_remove_value_outside() needs to know the type of buffer it is
looking at. Pass in an ocfs2_xattr_value_buf.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_xattr_set_entry is the function that knows what type of block it
is setting into. This is what we wanted from ocfs2_xattr_value_buf.
Plus, moving the value buf up into ocfs2_xattr_set_entry() allows us to
pass it into ocfs2_xattr_set_value_outside() and ocfs2_xattr_cleanup().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_xattr_update_entry() updates the entry portion of an xattr buffer.
This can be part of multiple metadata block types, so pass the buffer in
via an ocfs2_xattr_value_buf.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The callers of ocfs2_xattr_value_truncate() now pass in
ocfs2_xattr_value_bufs. These callers are the ones that calculated the
xv location, so they are the right starting point.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Place an ocfs2_xattr_value_buf in ocfs2_xattr_value_truncate() and pass
it down to ocfs2_xattr_shrink_size(). We can also pass it into
ocfs2_xattr_extend_allocation(), replacing its ocfs2_xattr_value_buf.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Place an ocfs2_xattr_value_buf in __ocfs2_xattr_shrink_size() and pass
it down to __ocfs2_remove_xattr_range().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When an ocfs2 extended attribute is large enough to require its own
allocation tree, we root it with an ocfs2_xattr_value_root. However,
these roots can be a part of inodes, xattr blocks, or xattr buckets.
Thus, they need a different journal access function for each container.
We wrap the bh, its journal access function, and the value root (xv) in
a structure called ocfs2_xattr_valu_buf. This is a package that can
be passed around. In this first pass, we simply pass it to the
extent tree code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The xattr bucket can span multiple blocks on disk. We have wrappers
for this structure in the code. We use the new multi-block ecc calls to
calculate and validate the bucket.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
commit triggers and allow us to compute metadata ecc right before the
buffers are written out. This commit provides ecc for inodes, extent
blocks, group descriptors, and quota blocks. It is not safe to use
extened attributes and metaecc at the same time yet.
The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
the type of block at their root. Before, it didn't matter, but now the
root block must use the appropriate ocfs2_journal_access_*() function.
To keep this abstract, the structures now have a pointer to the matching
journal_access function and a wrapper call to call it.
A few places use naked ocfs2_write_block() calls instead of adding the
blocks to the journal. We make sure to calculate their checksum and ecc
before the write.
Since we pass around the journal_access functions. Let's typedef them
in ocfs2.h.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The majority of ocfs2_new_path() calls are:
ocfs2_new_path(path_root_bh(otherpath),
path_root_el(otherpath));
Let's call that ocfs2_new_path_from_path(). The rest do similar things
from struct ocfs2_extent_tree. Let's call those
ocfs2_new_path_from_et(). This will make the next change easier.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We create wrappers for ocfs2_journal_access() that are specific to the
type of metadata block. This allows us to associate jbd2 commit
triggers with the block. The triggers will compute metadata ecc in a
future commit.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add block check calls to the read_block validate functions. This is the
almost all of the read-side checking of metaecc. xattr buckets are not checked
yet. Writes are also unchecked, and so a read-write mount will quickly fail.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add a currently-returns-success hook for quota block reads. We'll be
adding checks to this.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the code that computes crc32 and ecc for ocfs2 metadata blocks.
There are high-level functions that check whether the filesystem has the
ecc feature, mid-level functions that work on a single block or array of
buffer_heads, and the low-level ecc hamming code that can handle
multiple buffers like crc32_le().
It's not hooked up to the filesystem yet.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Define struct ocfs2_block_check, an 8-byte structure containing a 32bit
crc32_le and a 16bit hamming code ecc. This will be used for metadata
checksums. Add the structure to free spaces in the various metadata
structures.
Add the OCFS2_FEATURE_INCOMPAT_META_ECC bit.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A new mlog mask has to be added into mlog_attribute before it can
be really used in mlog. ML_QUOTA is only added in masklog.h, so
add it to the array to enable it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Pass the actual target bucket for insert through to
ocfs2_add_new_xattr_bucket(). Now growing a bucket has no buffer_head
knowledge.
ocfs2_add_new_xattr_bucket() leavs xs->bucket in the proper state for
insert. However, it doesn't update the rest of the search fields in xs,
so we still have to relse() and re-find. That's OK, because everything
is cached.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Lift the buckets from ocfs2_add_new_xattr_cluster() up into
ocfs2_add_new_xattr_bucket(). Now ocfs2_add_new_xattr_cluster()
doesn't deal with buffer_heads. In fact, we no longer have to play
get_bh() tricks at all.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Lift the buckets from ocfs2_adjust_xattr_cross_cluster() up into
ocfs2_add_new_xattr_cluster(). Now ocfs2_adjust_xattr_cross_cluster()
doesn't deal with buffer_heads.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that ocfs2_adjust_xattr_cross_cluster() has buckets, it can pass
them into ocfs2_mv_xattr_bucket_cross_cluster(). It no longer has to
care about buffer_heads. The manipulation of first_bh and header_bh
moves up to ocfs2_adjust_xattr_cross_cluster().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We want to be passing around buckets instead of buffer_heads. Let's get
them into ocfs2_adjust_xattr_cross_cluster.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that ocfs2_mv_xattr_buckets() can move a partial cluster's worth of
buckets, ocfs2_mv_xattr_bucket_cross_cluster() can use it.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
If you look at ocfs2_mv_xattr_bucket_cross_cluster(), you'll notice that
two-thirds of the code is almost identical to ocfs2_mv_xattr_buckets().
The only difference is that ocfs2_mv_xattr_buckets() moves a whole
cluster's worth, while ocfs2_mv_xattr_bucket_cross_cluster() moves half
the cluster.
We change ocfs2_mv_xattr_buckets() to allow moving partial clusters.
The original caller of ocfs2_mv_xattr_buckets() still moves the whole
cluster's worth - it just passes a start_bucket of 0.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_cp_xattr_cluster() takes the last cluster of an xattr extent,
copies its buckets to the front of a new extent, and then shrinks the bucket
count of the original extent. So it's really moving the data, not
copying it.
While we're here, the function doesn't need a buffer_head for the old
extent, just the block number.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The buffer copy loop of ocfs2_mv_xattr_bucket_cross_cluster() actually
looks a lot like ocfs2_cp_xattr_bucket(). Let's just use that instead.
We also use bucket operations to update the buckets at the start of each
extent.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I was unsure of the JOURNAL_ACCESS parameters in
ocfs2_cp_xattr_cluster(). They're based on the function argument
't_is_new', but I couldn't quite figure out how t_is_new mapped to
allocation. ocfs2_cp_xattr_cluster() actually overwrites the target,
regardless of t_is_new.
Well, I just figured it out. So I'm adding a big fat comment for those
who come after me. ocfs2_divide_xattr_cluster() has the same behavior.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_cp_xattr_cluster() takes the last bucket of a full extent and
copies it over to a new extent. It then updates the headers of both
extents to reflect the new state. It is passed the first bh of
the first bucket in order to update that first extent's bucket count.
It reads and dirties the first bh of the new extent for the same reason.
However, future code wants to always dirty the entire bucket when it
is changed. So it is changed to read the entire bucket it is updating
for both extents.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_extend_xattr_bucket() takes an extent of buckets and shifts some
of them down to make room for a new xattr. It is passed the first bh of
the first bucket, because that is where we store the number of buckets
in the extent.
However, future code wants to always dirty the entire bucket when it
is changed. So let's pass the entire bucket into this function, skip
any block reads (we have them), and add the access/dirty logic. We also
can skip passing in the target bucket bh - we only need its block
number.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We move the transaction into the loop because in
ocfs2_remove_extent, we will double the credits in function
ocfs2_extend_rotate_transaction. So if we have a large loop
number, we will soon waste much the journal space.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_bucket_value_truncate() currently takes the first bh of the
bucket, and magically plays around with the value bh - even though
the bucket structure in the calling function already has it.
In addition, future code wants to always dirty the entire bucket when it
is changed. So let's pass the entire bucket into this function, skip
any block reads (we have them), and add the access/dirty logic.
ocfs2_xattr_update_value_size() is no longer necessary, as it only did
one thing other than journal access/dirty.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix 2 minor things in quota. They are both found by sparse check.
1. an endian bug in ocfs2_local_quota_add_chunk.
2. change olq_alloc_dquot to static.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
fs/ocfs2/quota_local.c: In function 'olq_set_dquot':
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64'
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64'
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64'
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64'
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 7 has type '__le64'
fs/ocfs2/quota_local.c:844: warning: format '%lld' expects type 'long long int', but argument 8 has type '__le64'
fs/ocfs2/quota_global.c: In function '__ocfs2_sync_dquot':
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64'
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64'
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64'
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64'
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 8 has type 's64'
fs/ocfs2/quota_global.c:457: warning: format '%lld' expects type 'long long int', but argument 10 has type 's64'
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Make function return error status and not buffer pointer so that it's
consistent with ocfs2_read_quota_block().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We have to mark buffer as uptodate before calling ocfs2_journal_access() and
ocfs2_set_buffer_uptodate() does not do this for us.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_bread() has become ocfs2_read_virt_blocks(), with a prototype to
match ocfs2_read_blocks(). The quota code, converting from
ocfs2_bread(), wraps the call to ocfs2_read_virt_blocks() in
ocfs2_read_quota_block(). Unfortunately, the prototype of
ocfs2_read_quota_block() matches the old prototype of ocfs2_bread().
The problem is that ocfs2_bread() returned the buffer head, and callers
assumed that a NULL pointer was indicative of error. It wasn't. This
is why ocfs2_bread() took an int*err argument as well.
The new prototype of ocfs2_read_virt_blocks() avoids this error handling
confusion. Let's change ocfs2_read_quota_block() to match.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Enable quota usage tracking on mount and disable it on umount. Also
add support for quota on and quota off quotactls and usrquota and
grpquota mount options. Add quota features among supported ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Implement functions for recovery after a crash. Functions just
read local quota file and sync info to global quota file.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates a work queue for periodic syncing of locally cached quota
information to the global quota files. We constantly queue a delayed work
item, to get the periodic behavior.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
Add quota calls for allocation and freeing of inodes and space, also update
estimates on number of needed credits for a transaction. Move out inode
allocation from ocfs2_mknod_locked() because vfs_dq_init() must be called
outside of a transaction.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For each quota type each node has local quota file. In this file it stores
changes users have made to disk usage via this node. Once in a while this
information is synced to global file (and thus with other nodes) so that
limits enforcement at least aproximately works.
Global quota files contain all the information about usage and limits. It's
mostly handled by the generic VFS code (which implements a trie of structures
inside a quota file). We only have to provide functions to convert structures
from on-disk format to in-memory one. We also have to provide wrappers for
various quota functions starting transactions and acquiring necessary cluster
locks before the actual IO is really started.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mark system files as not subject to quota accounting. This prevents
possible recursions into quota code and thus deadlocks.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
OCFS2 can easily support nested transactions. We just have to
take care and not spoil statistics acquire semaphore unnecessarily.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
During an xattr set, when we move a xattr which was stored in inode to the
outside bucket, we have to delete it and it will use the old value of
xis->not_found. xis->not_found is removed by ocfs2_calc_xattr_set_need
though, so we must restore it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When we extend one xattr's value to a large size, the old value size might
be smaller than the size of a value root. In those cases, we still need to
guess the metadata allocation.
Reported-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
JBD2 is fully backwards compatible with JBD and it's been tested enough with
Ocfs2 that we can clean this code up now.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that we've centralized the ocfs2_read_virt_blocks() code, let's use
it in ocfs2_read_dir_block().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_read_dir_block() function really maps an inode's virtual
blocks to physical ones before calling ocfs2_read_blocks(). Let's
extract that to common code, because other places might want to do that.
Other than the block number being virtual, ocfs2_read_virt_blocks()
takes the same arguments as ocfs2_read_blocks(). It converts those
virtual block numbers to physical before calling ocfs2_read_blocks()
directly. If the blocks asked for are discontiguous, this can mean
multiple calls to ocfs2_read_blocks(), but this is mostly hidden from
the caller.
Like ocfs2_read_blocks(), the caller can pass in an existing
buffer_head. This is usually done to pick up some readahead I/O.
ocfs2_read_virt_blocks() checks the buffer_head's block number
against the extent map - it must match.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add an optional validation hook to ocfs2_read_blocks(). Now the
validation function is only called when a block was actually read off of
disk. It is not called when the buffer was in cache.
We add a buffer state bit BH_NeedsValidate to flag these buffers. It
must always be one higher than the last JBD2 buffer state bit.
The dinode, dirblock, extent_block, and xattr_block validators are
lifted to this scheme directly. The group_descriptor validator needs to
be split into two pieces. The first part only needs the gd buffer and
is passed to ocfs2_read_block(). The second part requires the dinode as
well, and is called every time. It's only 3 compares, so it's tiny.
This also allows us to clean up the non-fatal gd check used by resize.c.
It now has no magic argument.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We weren't consistently checking xattr blocks after we read them.
Most places checked the signature, but none checked xb_blkno or
xb_fs_signature. Create a toplevel ocfs2_read_xattr_block() that does
the read and the validation.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We have ocfs2_bread() as a vestige of the original ext-based dir code.
It's only used by directories, though. Turn it into
ocfs2_read_dir_block(), with a prototype matching the other metadata
read functions. It's set up to validate dirblocks when the time comes.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We weren't consistently checking extent blocks after we read them.
Most places checked the signature, but none checked h_blkno or
h_fs_signature. Create a toplevel ocfs2_read_extent_block() that does
the read and the validation.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Random places in the code would check a group descriptor bh to see if it
was valid. The previous commit unified descriptor block reads,
validating all block reads in the same place. Thus, these checks are no
longer necessary. Rather than eliminate them, however, we change them
to BUG_ON() checks. This ensures the assumptions remain true. All of
the code paths to these checks have been audited to ensure they come
from a validated descriptor read.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We have a clean call for validating group descriptors, but every place
that wants the always does a read_block()+validate() call pair. Create
a toplevel ocfs2_read_group_descriptor() that does the right
thing. This allows us to leverage the single call point later for
fancier handling. We also add validation of gd->bg_generation against
the superblock and gd->bg_blkno against the block we thought we read.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Currently the validation of group descriptors is directly duplicated so
that one version can error the filesystem and the other (resize) can
just report the problem. Consolidate to one function that takes a
boolean. Wrap that function with the old call for the old users.
This is in preparation for lifting the read+validate step into a
single function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Random places in the code would check a dinode bh to see if it was
valid. Not only did they do different levels of validation, they
handled errors in different ways.
The previous commit unified inode block reads, validating all block
reads in the same place. Thus, these haphazard checks are no longer
necessary. Rather than eliminate them, however, we change them to
BUG_ON() checks. This ensures the assumptions remain true. All of the
code paths to these checks have been audited to ensure they come from a
validated inode read.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 code currently reads inodes off disk with a simple
ocfs2_read_block() call. Each place that does this has a different set
of sanity checks it performs. Some check only the signature. A couple
validate the block number (the block read vs di->i_blkno). A couple
others check for VALID_FL. Only one place validates i_fs_generation. A
couple check nothing. Even when an error is found, they don't all do
the same thing.
We wrap inode reading into ocfs2_read_inode_block(). This will validate
all the above fields, going readonly if they are invalid (they never
should be). ocfs2_read_inode_block_full() is provided for the places
that want to pass read_block flags. Every caller is passing a struct
inode with a valid ip_blkno, so we don't need a separate blkno argument
either.
We will remove the validation checks from the rest of the code in a
later commit, as they are no longer necessary.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds the Kconfig option "CONFIG_OCFS2_FS_POSIX_ACL"
and mount options "acl" to enable acls in Ocfs2.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We need to get the parent directories acls and let the new child inherit it.
To this, we add additional calculations for data/metadata allocation.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This function is used to update acl xattrs during file mode changes.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This function is used to enhance permission checking with POSIX ACLs.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds POSIX ACL(access control lists) APIs in ocfs2. We convert
struct posix_acl to many ocfs2_acl_entry and regard them as an extended
attribute entry.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This function does the work of ocfs2_xattr_get under an open lock.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Security attributes must be set when creating a new inode.
We do this in three steps.
- First, get security xattr's name and value by security_operation
- Calculate and reserve the meta data and clusters needed by this security
xattr before starting transaction
- Finally, we set it before add_entry
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch add security xattr set/get/list APIs to
support security attributes in Ocfs2.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This function is used to set xattr's in a started transaction. It is only
called during inode creation inode for initial security/acl xattrs of the
new inode. These xattrs could be put into ibody or extent block, so xattr
bucket would not be use in this case.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Move out inode allocation from ocfs2_mknod_locked() because
vfs_dq_init() must be called outside of a transaction.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch genericizes the high level handling of extent removal.
ocfs2_remove_btree_range() is nearly identical to
__ocfs2_remove_inode_range(), except that extent tree operations have been
used where necessary. We update ocfs2_remove_inode_range() to use the
generic helper. Now extent tree based structures have an easy way to
truncate ranges.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
In current ocfs2/xattr, the whole xattr set is divided into
many steps are many transaction are used, this make the
xattr set process isn't like a real transaction, so this
patch try to merge all the transaction into one. Another
benefit is that acl can use it easily now.
I don't merge the transaction of deleting xattr when we
remove an inode. The reason is that if we have a large number
of xattrs and every xattrs has large values(large enough
for outside storage), the whole transaction will be very
huge and it looks like jbd can't handle it(I meet with a
jbd complain once). And the old inode removal is also divided
into many steps, so I'd like to leave as it is.
Note:
In xattr set, I try to avoid ocfs2_extend_trans since if
the credits aren't enough for the extension, it will commit
all the dirty blocks and create a new transaction which may
lead to inconsistency in metadata. All ocfs2_extend_trans
remained are safe now.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2 xattr set, we reserve metadata and clusters in any place
they are needed. It is time-consuming and ineffective, so this
patch try to reserve metadata and clusters at the beginning of
ocfs2_xattr_set.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Move clusters free process into dealloc context so that
they can be freed after the transaction.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now in ocfs2 xattr set, the whole process are divided into many small
parts and they are wrapped into diffrent transactions and it make the
set doesn't look like a real transaction. So we want to integrate it
into a real one.
In some cases we will allocate some clusters and free some in just one
transaction. e.g, one xattr is larger than inline size, so it and its
value root is stored within the inode while the value is outside in a
cluster. Then we try to update it with a smaller value(larger than the
size of root but smaller than inline size), we may need to free the
outside cluster while allocate a new bucket(one cluster) since now the
inode may be full. The old solution will lock the global_bitmap(if the
local alloc failed in stress test) and then the truncate log. This will
cause a ABBA lock with truncate log flush.
This patch add the clusters free in dealloc_ctxt, so that we can record
the free clusters during the transaction and then free it after we
release the global_bitmap in xattr set.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When the first block of a bucket is filled up with xattr
entries, we normally extend the bucket. But if we are
just replace one xattr with small length, we don't need
to extend it. This is important since we will calculate
what we need before the transaction and in this situation
no resources will be allocated.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When we call ocfs2_init_xattr_bucket, we deem that the new buffer head
will be written to disk immediately, so we just use sb_getblk. But in
some cases the buffer may have already been in ocfs2 uptodate cache,
so we only call ocfs2_set_buffer_uptodate if the buffer head isn't
in the cache.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Joel has refactored xattr bucket and make xattr bucket a general
wrapper. So in ocfs2_defrag_xattr_bucket, we have already passed the
bucket in, so there is no need to allocate a new one and read it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_xattr_set_entry_in_bucket() function is already working on an
ocfs2_xattr_bucket structure, so let's use the bucket API.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use the ocfs2_xattr_bucket abstraction for reading and writing the
bucket in ocfs2_defrag_xattr_bucket().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Use the ocfs2_xattr_bucket abstraction in
ocfs2_xattr_create_index_block() and its helpers. We get more efficient
reads, a lot less buffer_head munging, and nicer code to boot. While
we're at it, ocfs2_xattr_update_xattr_search() becomes void.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the ocfs2_xattr_bucket_find() function to use ocfs2_xattr_bucket
as its abstraction. This makes for more efficient reads, as buckets are
linear blocks, and also has improved caching characteristics. It also
reads better.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_xattr_bucket structure is a nice abstraction, but it is a bit
large to have on the stack. Just like ocfs2_path, let's allocate it
with a ocfs2_xattr_bucket_new() function.
We can now store the inode on the bucket, cleaning up all the other
bucket functions. While we're here, we catch another place or two that
wasn't using ocfs2_read_xattr_bucket().
Updates:
- No longer allocating xis.bucket, as it will never be used.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that the places that copy whole buckets are using struct
ocfs2_xattr_bucket, we can do the copy in a dedicated function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A common action is to call ocfs2_journal_access() and
ocfs2_journal_dirty() on the buffer heads of an xattr bucket. Let's
create nice wrappers.
While we're there, let's drop the places that try to be smart by writing
only the first and last blocks of a bucket. A bucket is contiguous, so
writing the whole thing is actually more efficient.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_read_xattr_bucket() function would read an xattr bucket into a
list of buffer heads. However, we have a nice ocfs2_xattr_bucket
structure. Let's have it fill that out instead.
In addition, ocfs2_read_xattr_bucket() would initialize buffer heads for
a bucket that's never been on disk before. That's confusing. Let's
call that functionality ocfs2_init_xattr_bucket().
The functions ocfs2_cp_xattr_bucket() and ocfs2_half_xattr_bucket() are
updated to use the ocfs2_xattr_bucket structure rather than raw bh
lists. That way they can use the new read/init calls. In addition,
they drop the wasted read of an existing target bucket.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A common theme is walking all the buffer heads on an ocfs2_xattr_bucket
and releasing them. Let's wrap that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The xattr code often wants to access the ocfs2_xattr_header at the start
of an bucket. Rather than walk the pointer chains, let's just create
another nice macro. As a side benefit, we can get rid of the mostly
spurious ->bu_xh element on the bucket structure. The idea is ripped
from the ocfs2_path code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The xattr code often wants to access the data pointer for blocks in an
xattr bucket. This is usually found by dereferencing the bh array
hanging off of the ocfs2_xattr_bucket structure. Rather than do this
all the time, let's provide a nice little macro. The idea is ripped
from the ocfs2_path code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The xattr code often wants to know the block number of an xattr bucket.
This is usually found by dereferencing the first bh hanging off of the
ocfs2_xattr_bucket structure. Rather than do this all the time, let's
provide a nice little macro. The idea is ripped from the ocfs2_path
code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_xattr_bucket structure keeps track of the buffers for one
xattr bucket. Let's prefix the fields for easier code navigation.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
net: Allow dependancies of FDDI & Tokenring to be modular.
igb: Fix build warning when DCA is disabled.
net: Fix warning fallout from recent NAPI interface changes.
gro: Fix potential use after free
sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
sfc: When disabling the NIC, close the device rather than unregistering it
sfc: SFT9001: Add cable diagnostics
sfc: Add support for multiple PHY self-tests
sfc: Merge top-level functions for self-tests
sfc: Clean up PHY mode management in loopback self-test
sfc: Fix unreliable link detection in some loopback modes
sfc: Generate unique names for per-NIC workqueues
802.3ad: use standard ethhdr instead of ad_header
802.3ad: generalize out mac address initializer
802.3ad: initialize ports LACPDU from const initializer
802.3ad: remove typedef around ad_system
802.3ad: turn ports is_individual into a bool
802.3ad: turn ports is_enabled into a bool
802.3ad: make ntt bool
ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
...
Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).
Define the OCFS2_FEATURE_COMPAT_JBD2 bit in the filesystem header.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When we create xattr bucket during the process of xattr set, we always
need to update the ocfs2_xattr_search since even if the bucket size is
the same as block size, the offset will change because of the removal
of the ocfs2_xattr_block header.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Conflicts:
fs/nfsd/nfs4recover.c
Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().
Signed-off-by: James Morris <jmorris@namei.org>
We're panicing in ocfs2_read_blocks_sync() if a jbd-managed buffer is seen.
At first glance, this seems ok but in reality it can happen. My test case
was to just run 'exorcist'. A struct inode is being pushed out of memory but
is then re-read at a later time, before the buffer has been checkpointed by
jbd. This causes a BUG to be hit in ocfs2_read_blocks_sync().
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In init_dlmfs_fs(), if calling kmem_cache_create() failed, the code will use return value from
calling bdi_init(). The correct behavior should be set status as -ENOMEM before going to "bail:".
Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_unlock_ast(), call wake_up() on lockres before releasing
the spin lock on it. As soon as the spin lock is released, the
lockres can be freed.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The locking_state dump, ocfs2_dlm_seq_show, reads the lvb on locks where it
has not yet been initialized by a lock call.
Signed-off-by: David Teigland <teigland@redhat.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.c
Fixed conflicts above by using the non 'tsk' versions.
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: ocfs2-devel@oss.oracle.com
Signed-off-by: James Morris <jmorris@namei.org>
ocfs2_xattr_block_get() calls ocfs2_xattr_search() to find an external
xattr, but doesn't check the search result that is passed back via struct
ocfs2_xattr_search. Add a check for search result, and pass back -ENODATA if
the xattr search failed. This avoids a later NULL pointer error.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2/xattr, we must make sure the xattrs which have the same hash value
exist in the same bucket so that the search schema can work. But in the old
implementation, when we want to extend a bucket, we just move half number of
xattrs to the new bucket. This works in most cases, but if we are lucky
enough we will move 2 xattrs into 2 different buckets. This means that an
xattr from the previous bucket cannot be found anymore. This patch fix this
problem by finding the right position during extending the bucket and extend
an empty bucket if needed.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_page_mkwrite, we return -EINVAL when we found the page mapping
isn't updated, and it will cause the user space program get SIGBUS and
exit. The reason is that during race writeable mmap, we will do
unmap_mapping_range in ocfs2_data_downconvert_worker. The good thing is
that if we reuturn 0 in page_mkwrite, VFS will retry fault and then
call page_mkwrite again, so it is safe to return 0 here.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch sets journal descriptor to NULL after the journal is shutdown.
This ensures that jbd2_journal_release_jbd_inode(), which removes the
jbd2 inode from txn lists, can be called safely from ocfs2_clear_inode()
even after the journal has been shutdown.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
On failure, ocfs2_start_trans() returns values like ERR_PTR(-ENOMEM),
so we should check whether handle is NULL. Fix them to use IS_ERR().
Jan has made the patch for other part in ocfs2(thank Jan for it), so
this is just the fix for fs/ocfs2/xattr.c.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We forgot to set i_nlink to 0 when returning due to error from ocfs2_mknod_locked()
and thus inode was not properly released via ocfs2_delete_inode() (e.g. claimed
space was not released). Fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
new_inode() does not return ERR_PTR() but NULL in case of failure. Correct
checking of the return value.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
On failure, ocfs2_start_trans() returns values like ERR_PTR(-ENOMEM).
Thus checks for !handle are wrong. Fix them to use IS_ERR().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix some typos in the xattr annotations.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Reported-by: Coly Li <coyli@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Since now ocfs2 supports empty xattr buckets, we will never remove
the xattr index tree even if all the xattrs are removed, so this
function will never be called. So remove it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_xattr_block_get() looks up the xattr in a startlingly familiar
way; it's identical to the function ocfs2_xattr_block_find(). Let's just
use the later in the former.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
There are a couple places that get an xattr bucket that may be reading
an existing one or may be allocating a new one. They should specify the
correct journal access mode depending.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_xattr_update_xattr_search() function can return an error when
trying to read blocks off of disk. The caller needs to check this error
before using those (possibly invalid) blocks.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
If the xattr disk structures are corrupt, return -EIO, not -EFAULT.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The xattr.c code is currently memcmp()ing naking buffer pointers.
Create the OCFS2_IS_VALID_XATTR_BLOCK() macro to match its peers and use
that.
In addition, failed signature checks were returning -EFAULT, which is
completely wrong. Return -EIO.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Make the handler_map array as large as the possible value range to avoid
a fencepost error.
[ Utilize alternate method -- Joel ]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Include/linux/xattr.h already has the definition about xattr prefix,
so remove the duplicate definitions in xattr.c.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Because we merged the xattr sources into one file, some functions
no longer belong in the header file.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes the license in xattr.c and xattr.h.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing uses prepare_write or commit_write. Remove them from the tree
completely.
[akpm@linux-foundation.org: schedule simple_prepare_write() for unexporting]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
[PATCH] kill the rest of struct file propagation in block ioctls
[PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
[PATCH] get rid of blkdev_locked_ioctl()
[PATCH] get rid of blkdev_driver_ioctl()
[PATCH] sanitize blkdev_get() and friends
[PATCH] remember mode of reiserfs journal
[PATCH] propagate mode through swsusp_close()
[PATCH] propagate mode through open_bdev_excl/close_bdev_excl
[PATCH] pass fmode_t to blkdev_put()
[PATCH] kill the unused bsize on the send side of /dev/loop
[PATCH] trim file propagation in block/compat_ioctl.c
[PATCH] end of methods switch: remove the old ones
[PATCH] switch sr
[PATCH] switch sd
[PATCH] switch ide-scsi
[PATCH] switch tape_block
[PATCH] switch dcssblk
[PATCH] switch dasd
[PATCH] switch mtd_blkdevs
[PATCH] switch mmc
...
* get rid of fake struct file/struct dentry in __blkdev_get()
* merge __blkdev_get() and do_open()
* get rid of flags argument of blkdev_get()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ocfs2_read_blocks() currently requires the CACHED flag for cached I/O.
However, that's the common case. Let's flip it around and provide an
IGNORE_CACHE flag for the special users. This has the added benefit of
cleaning up the code some (ignore_cache takes on its special meaning
earlier in the loop).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2's cached buffer I/O goes through ocfs2_read_block(s)(). dir.c had
a naked wait_on_buffer() to wait for some readahead, but it should
use ocfs2_read_block() instead.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dir.c is the only place using ocfs2_bread(), so let's make it static to
that file.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
More than 30 callers of ocfs2_read_block() pass exactly OCFS2_BH_CACHED.
Only six pass a different flag set. Rather than have every caller care,
let's make ocfs2_read_block() take no flags and always do a cached read.
The remaining six places can call ocfs2_read_blocks() directly.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that synchronous readers are using ocfs2_read_blocks_sync(), all
callers of ocfs2_read_blocks() are passing an inode. Use it
unconditionally. Since it's there, we don't need to pass the
ocfs2_super either.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_read_blocks() function currently handles sync reads, cached,
reads, and sometimes cached reads. We're going to add some
functionality to it, so first we should simplify it. The uncached,
synchronous reads are much easer to handle as a separate function, so we
instroduce ocfs2_read_blocks_sync().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
According to Christoph Hellwig's advice, we really don't need
a ->list to handle one xattr's list. Just a map from index to
xattr prefix is enough. And I also refactor the old list method
with the reference from fs/xfs/linux-2.6/xfs_xattr.c and the
xattr list method in btrfs.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
According to Christoph Hellwig's advice, the hash value of EA
is only calculated by its suffix.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Per Christoph Hellwig's suggestion - don't split these up. It's not like we
gained much by having the two tiny files around.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
i and b_len don't really need to be u64's. Xattr extent lengths should be
limited by the VFS, and then the size of our on-disk length field.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
As Mark mentioned, it may be time-consuming when we remove the
empty xattr bucket, so this patch try to let empty bucket exist
in xattr operation. The modification includes:
1. Remove the functin of bucket and extent record deletion during
xattr delete.
2. In xattr set:
1) Don't clean the last entry so that if the bucket is empty,
the hash value of the bucket is the hash value of the entry
which is deleted last.
2) During insert, if we meet with an empty bucket, just use the
1st entry.
3. In binary search of xattr bucket, use the bucket hash value(which
stored in the 1st xattr entry) to find the right place.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
During the process of xatt insertion, we use binary search
to find the right place and "low" is set to it. But when
there is one xattr which has the same name hash as the inserted
one, low is the wrong value. So set it to the right position.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch adds check for [no]user_xattr in ocfs2_show_options() that completes
the list of all mount options.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.
It's a pretty trivial change. Most functions are just renamed. The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.
Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem. It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.
We provide a compatibility option so that paranoid people can still use
JBD for the time being. This will go away shortly.
[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
ocfs2_truncate_for_delete(). --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit. This parallels XFS. 64bit systems can handle the
larger inode numbers.
[ Added description of inode64 mount option in ocfs2.txt. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 inode numbers are block numbers. For any filesystem with less
than 2^32 blocks, this is not a problem. However, when ocfs2 starts
using JDB2, it will be able to support filesystems with more than 2^32
blocks. This would result in inode numbers higher than 2^32.
The problem is that stat(2) can't handle those numbers on 32bit
machines. The simple solution is to have ocfs2 allocate all inodes
below that boundary.
The suballoc code is changed to honor an optional block limit. Only the
inode suballocator sets that limit - all other allocations stay unlimited.
The biggest trick is to grow the inode suballocator beneath that limit.
There's no point in allocating block groups that are above the limit,
then rejecting their elements later on. We want to prevent the inode
allocator from ever having block groups above the limit. This involves
a little gyration with the local alloc code. If the local alloc window
is above the limit, it signals the caller to try the global bitmap but
does not disable the local alloc file (which can be used for other
allocations).
[ Minor cleanup - removed an ML_NOTICE comment. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_xattr_free_block, we take a cluster lock on xb_alloc_inode while we
have a transaction open. This will deadlock the downconvert thread, so fix
it.
We can clean up how xattr blocks are removed while here - this patch also
moves the mechanism of releasing xattr block (including both value, xattr
tree and xattr block) into this function.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_extend_trans, when we can't extend the current
transaction, it will commit current transaction and restart
a new one. So if the previous credits we have allocated aren't
used(the block isn't dirtied before our extend), we will not
have enough credits for any future operation(it will cause jbd
complain and bug out). So check this and re-extend it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The original get/put_extent_tree() functions held a reference on
et_root_bh. However, every single caller already has a safe reference,
making the get/put cycle irrelevant.
We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree(). It
no longer gets a reference on et_root_bh. ocfs2_put_extent_tree() is
removed. Callers now have a simpler init+use pattern.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
struct ocfs2_extent_tree_operations provides methods for the different
on-disk btrees in ocfs2. Describing what those methods do is probably a
good idea.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We now have three different kinds of extent trees in ocfs2: inode data
(dinode), extended attributes (xattr_tree), and extended attribute
values (xattr_value). There is a nice abstraction for them,
ocfs2_extent_tree, but it is hidden in alloc.c. All the calling
functions have to pick amongst a varied API and pass in type bits and
often extraneous pointers.
A better way is to make ocfs2_extent_tree a first-class object.
Everyone converts their object to an ocfs2_extent_tree() via the
ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
tree calls to alloc.c.
This simplifies a lot of callers, making for readability. It also
provides an easy way to add additional extent tree types, as they only
need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree()
function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A couple places check an extent_tree for a valid inode. We move that
out to add an eo_insert_check() operation. It can be called from
ocfs2_insert_extent() and elsewhere.
We also have the wrapper calls ocfs2_et_insert_check() and
ocfs2_et_sanity_check() ignore NULL ops. That way we don't have to
provide useless operations for xattr types.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A caller knows what kind of extent tree they have. There's no reason
they have to call ocfs2_get_extent_tree() with a NULL when they could
just as easily call a specific function to their type of extent tree.
Introduce ocfs2_dinode_get_extent_tree(),
ocfs2_xattr_tree_get_extent_tree(), and
ocfs2_xattr_value_get_extent_tree(). They only take the necessary
arguments, calling into the underlying __ocfs2_get_extent_tree() to do
the real work.
__ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but
without needing any switch-by-type logic.
ocfs2_get_extent_tree() is now a wrapper around the specific calls. It
exists because a couple alloc.c functions can take et_type. This will
go later.
Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a
struct ocfs2_xattr_value_root* instead of void*. This gives us
typechecking where we didn't have it before.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Provide an optional extent_tree_operation to specify the
max_leaf_clusters of an ocfs2_extent_tree. If not provided, the value
is 0 (unlimited).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_num_free_extents() re-implements the logic of
ocfs2_get_extent_tree(). Now that ocfs2_get_extent_tree() does not
allocate, let's use it in ocfs2_num_free_extents() to simplify the code.
The inode validation code in ocfs2_num_free_extents() is not needed.
All callers are passing in pre-validated inodes.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The root_el of an ocfs2_extent_tree needs to be calculated from
et->et_object. Make it an operation on et->et_ops.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The 'private' pointer was a way to store off xattr values, which don't
live at a set place in the bh. But the concept of "the object
containing the extent tree" is much more generic. For an inode it's the
struct ocfs2_dinode, for an xattr value its the value. Let's save off
the 'object' at all times. If NULL is passed to
ocfs2_get_extent_tree(), 'object' is set to bh->b_data;
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Rather than allocating a struct ocfs2_extent_tree, just put it on the
stack. Fill it with ocfs2_get_extent_tree() and drop it with
ocfs2_put_extent_tree(). Now the callers don't have to ENOMEM, yet
still safely ref the root_bh.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The members of the ocfs2_extent_tree structure gain a prefix of 'et_'.
All users are updated.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_extent_tree_operations structure gains a field prefix on its
members. The ->eo_sanity_check() operation gains a wrapper function for
completeness. All of the extent tree operation wrappers gain a
consistent name (ocfs2_et_*()).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes the following build warnings:
fs/ocfs2/xattr.c: In function 'ocfs2_half_xattr_bucket':
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c: In function 'ocfs2_xattr_set_entry_in_bucket':
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In inode removal, we need to iterate all the buckets, remove any
externally-stored EA values and delete the xattr buckets.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Where the previous patches added the ability of list/get xattr in buckets
for ocfs2, this patch enables ocfs2 to store large numbers of EAs.
The original design doc is written by Mark Fasheh, and it can be found in
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/IndexedEATrees. I only had to
make small modifications to it.
First, because the bucket size is 4K, a new field named xh_free_start is added
in ocfs2_xattr_header to indicate the next valid name/value offset in a bucket.
It is used when we store new EA name/value. With this field, we can find the
place more quickly and what's more, we don't need to sort the name/value every
time to let the last entry indicate the next unused space. This makes the
insert operation more efficient for blocksizes smaller than 4k.
Because of the new xh_free_start, another field named as xh_name_value_len is
also added in ocfs2_xattr_header. It records the total length of all the
name/values in the bucket. We need this so that we can check it and defragment
the bucket if there is not enough contiguous free space.
An xattr insertion looks like this:
1. xattr_index_block_find: find the right bucket by the name_hash, say bucketA.
2. check whether there is enough space in bucketA. If yes, insert it directly
and modify xh_free_start and xh_name_value_len accordingly. If not, check
xh_name_value_len to see whether we can store this by defragment the bucket.
If yes, defragment it and go on insertion.
3. If defragement doesn't work, check whether there is new empty bucket in
the clusters within this extent record. If yes, init the new bucket and move
all the buckets after bucketA one by one to the next bucket. Move half of the
entries in bucketA to the next bucket and go on insertion.
4. If there is no new bucket, grow the extent tree.
As for xattr deletion, we will delete an xattr bucket when all it's xattrs
are removed and move all the buckets after it to the previous one. When all
the xattr buckets in an extend record are freed, free this extend records
from ocfs2_xattr_tree.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In xattr bucket, we want to limit the maximum size of a btree leaf,
otherwise we'll lose the benefits of hashing because we'll have to search
large leaves.
So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster
size we want so that we can prevent ocfs2_insert_extent() from merging the leaf
record even if it is contiguous with an adjacent record.
Other btree types are not affected by this change.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add code to lookup a given extended attribute in the xattr btree. Lookup
follows this general scheme:
1. Use ocfs2_xattr_get_rec to find the xattr extent record
2. Find the xattr bucket within the extent which may contain this xattr
3. Iterate the bucket to find the xattr. In ocfs2_xattr_block_get(), we need
to recalcuate the block offset and name offset for the right position of
name/value.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Ocfs2 breaks up xattr index tree leaves into 4k regions, called buckets.
Attributes are stored within a given bucket, depending on hash value.
After a discussion with Mark, we decided that the per-bucket index
(xe_entry[]) would only exist in the 1st block of a bucket. Likewise,
name/value pairs will not straddle more than one block. This allows the
majority of operations to work directly on the buffer heads in a leaf block.
This patch adds code to iterate the buckets in an EA. A new abstration of
ocfs2_xattr_bucket is added. It records the bhs in this bucket and
ocfs2_xattr_header. This keeps the code neat, improving readibility.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to
store large numbers of EAs. This patch adds a new type in
ocfs2_extent_tree_type and adds the implementation so that we can re-use the
b-tree code to handle the storage of many EAs.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add some thin wrappers around ocfs2_insert_extent() for each of the 3
different btree types, ocfs2_inode_insert_extent(),
ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
last is for the xattr index btree, which will be used in a followup patch.
All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
while the other two handle the xattr issue. And the init of extent tree are
handled by these functions.
When storing xattr value which is too large, we will allocate some clusters
for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
order to re-use the b-tree operation code, a new parameter named "private"
is added into ocfs2_extent_tree and it is used to indicate the root of
ocfs2_exent_list. The reason is that we can't deduce the root from the
buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
in any place in an ocfs2_xattr_bucket.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old uptodate only handles the issue of removing one buffer_head from
ocfs2 inode's buffer cache. With xattr clusters, we may need to remove
multiple buffer_head's at a time.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Ocfs2 uses a very flexible structure for storing extended attributes on
disk. Small amount of attributes are stored directly in the inode block - up
to 256 bytes worth. If that fills up, attributes are also stored in an
external block, linked to from the inode block. That block can in turn
expand to a btree, capable of storing large numbers of attributes.
Individual attribute values are stored inline if they're small enough
(currently about 80 bytes, this can be changed though), and otherwise are
expanded to a btree. The theoretical limit to the size of an individual
attribute is about the same as an inode, though the kernel's upper bound on
the size of an attributes data is far smaller.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
ocfs2_do_cluster_allocation() now, but the latter can be used for other
btree types as well.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In the old extent tree operation, we take the hypothesis that we
are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
As xattr will also use ocfs2_extent_list to store large value
for a xattr entry, we refactor the tree operation so that xattr
can use it directly.
The refactoring includes 4 steps:
1. Abstract set/get of last_eb_blk and update_clusters since they may
be stored in different location for dinode and xattr.
2. Add a new structure named ocfs2_extent_tree to indicate the
extent tree the operation will work on.
3. Remove all the use of fe_bh and di, use root_bh and root_el in
extent tree instead. So now all the fe_bh is replaced with
et->root_bh, el with root_el accordingly.
4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
in file extend allocation. But the whole function is useful when we want
to store large EAs.
Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
for anything other than truncate inode data btrees.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
they are all limited to an inode btree because they use a struct
ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
(the part of an ocfs2_dinode they actually use) so that the xattr btree code
can use these functions.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_num_free_extents() is used to find the number of free extent records
in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
use this for extended attribute trees in the future, so genericize the
interface the take a buffer head. A future patch will allow that buffer_head
to contain any structure rooting an ocfs2 btree.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A per-mount debugfs file, "local_alloc" is created which when read will
expose live state of the nodes local alloc file. Performance impact is
minimal, only a bit of memory overhead per mount point. Still, the code is
hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
local alloc performance problems on a live system.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.
To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.
Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is actually pretty easy since fs/dlm already handles the bulk of the
work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the
underlying lock manager, so I only had to add the right calls.
Cluster-aware POSIX locks ("plocks") can be turned off by the same means at
UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume.
Internally, the file system uses two sets of file_operations, depending on
whether cluster aware plocks is required. This turns out to be easier than
implementing local-only versions of ->lock.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.
This was posted for review some time ago and I believe its been in -mm
since then.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Plug ocfs2 into ->fiemap. Some portions of ocfs2_get_clusters() had to be
refactored so that the extent cache can be skipped in favor of going
directly to the on-disk records. This makes it easier for us to determine
which extent is the last one in the btree. Also, I'm not sure we want to be
caching fiemap lookups anyway as they're not directly related to data
read/write.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ocfs2-devel@oss.oracle.com
Cc: linux-fsdevel@vger.kernel.org
ocfs2 will become read-only if we try to read the bytes which pass
the end of i_size. This can be easily reproduced by following steps:
1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse.
2. create a small file(say less than 100 bytes) and we will create the file
which is allocated 1 cluster.
3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit.
4. The ocfs2 volume becomes read-only and dmesg shows:
OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks:
Inode 66010 has a hole at block 1
File system is now read-only due to the potential of on-disk corruption.
Please run fsck.ocfs2 once the file system is unmounted.
So suppress the ERROR message.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_stack_driver_request() function failed to increment the
refcount of an already-active stack. It only did the increment on the
first reference. Whoops.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Tested-by: Marcos Matsunaga <marcos.matsunaga@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We were setting i_blocks based on allocation before the extent insert, which
is wrong as the value is a calculation based on ip_clusters which gets
updated as a result of the insert. This patch moves the line in question
to just after the call to ocfs2_insert_extent().
Without this fix, inline directories were temporarily having an i_blocks
value of zero immediately after expansion to extents.
Reported-and-tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When we fail to insert extent in ocfs2_expand_inline_dir(), we should go to
out_commit, not out.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This fixes a bug introduced with 539d8264093560b917ee3afe4c7f74e5da09d6a5:
[PATCH 2/2] ocfs2: Fix race between mount and recovery
ocfs2_mark_dead_nodes() was reading journal inodes while holding the
spinlock protecting our in-memory recovery state. The fix is very simple -
the disk state is protected by a cluster lock that's already held, so we
just move the spinlock down past the read.
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2/cluster/netdebug.c: fix warning
fs/ocfs2/cluster/netdebug.c:154: warning: format '%lu' expects
type 'long unsigned int', but argument 17 has type 'suseconds_t'
Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Commit 0f475b2abe (ocfs2/net: Silence build
warnings) made sense as far as it fixed compile warnings, but it was not
required that it made the functions global.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The mutex is released on a successful return, so it would seem that it
should be released on an error return as well.
The semantic patch finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
expression l;
@@
mutex_lock(l);
... when != mutex_unlock(l)
when any
when strict
(
if (...) { ... when != mutex_unlock(l)
+ mutex_unlock(l);
return ...;
}
|
mutex_unlock(l);
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes an oops that is reproduced when one races writes to a mmap-ed
region with another process truncating the file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
As the fs recovery is asynchronous, there is a small chance that another
node can mount (and thus recover) the slot before the recovery thread
gets to it.
If this happens, the recovery thread will block indefinitely on the
journal/slot lock as that lock will be held for the duration of the mount
(by design) by the node assigned to that slot.
The solution implemented is to keep track of the journal replays using
a recovery generation in the journal inode, which will be incremented by the
thread replaying that journal. The recovery thread, before attempting the
blocking lock on the journal/slot lock, will compare the generation on disk
with what it has cached and skip recovery if it does not match.
This bug appears to have been inadvertently introduced during the mount/umount
vote removal by mainline commit 34d024f843. In the
mount voting scheme, the messaging would indirectly indicate that the slot
was being recovered.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch renames the ij_pad to ij_recovery_generation in struct ocfs2_dinode.
This will be used to keep count of journal replays after an unclean shutdown.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* kill nameidata * argument; map the 3 bits in ->flags anybody cares
about to new MAY_... ones and pass with the mask.
* kill redundant gfs2_iop_permission()
* sanitize ecryptfs_permission()
* fix remaining places where ->permission() instances might barf on new
MAY_... found in mask.
The obvious next target in that direction is permission(9)
folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return ERR_PTR() values. These errors are
bubbled up appropriately. NULL returns are changed to -ENOMEM for
compatibility.
Also updated are the in-kernel users of configfs.
This is a rework of reverted commit 11c3b79218.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
[PATCH] ocfs2: fix oops in mmap_truncate testing
configfs: call drop_link() to cleanup after create_link() failure
configfs: Allow ->make_item() and ->make_group() to return detailed errors.
configfs: Fix failing mkdir() making racing rmdir() fail
configfs: Fix deadlock with racing rmdir() and rename()
configfs: Make configfs_new_dirent() return error code instead of NULL
configfs: Protect configfs_dirent s_links list mutations
configfs: Introduce configfs_dirent_lock
ocfs2: Don't snprintf() without a format.
ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs
ocfs2/net: Silence build warnings on sparc64
ocfs2: Handle error during journal load
ocfs2: Silence an error message in ocfs2_file_aio_read()
ocfs2: use simple_read_from_buffer()
ocfs2: fix printk format warnings with OCFS2_FS_STATS=n
[PATCH 2/2] ocfs2: Instrument fs cluster locks
[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option
This patch fixes a mmap_truncate bug which was found by ocfs2 test suite.
In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races
mmap writes and truncates from multiple processes. While the test is
running, a stat from another node forces writeout, causing an oops in
ocfs2_get_block() because it sees a buffer to write which isn't allocated.
This patch fixed the bug by clear dirty and uptodate bits in buffer, leave
the buffer unmapped and return.
Fix is suggested by Mark Fasheh, and I code up the patch.
Signed-off-by: Coly Li <coyli@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The configfs operations ->make_item() and ->make_group() currently
return a new item/group. A return of NULL signifies an error. Because
of this, -ENOMEM is the only return code bubbled up the stack.
Multiple folks have requested the ability to return specific error codes
when these operations fail. This patch adds that ability by changing the
->make_item/group() ops to return an int.
Also updated are the in-kernel users of configfs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Some system files are per-slot. Their names include the slot number.
ocfs2_sprintf_system_inode_name() uses the system inode definitions to
fill in the slot number with snprintf().
For global system files, there is no node number, and the name was
printed as a format with no arguments. -Wformat-nonliteral and
-Wformat-security don't like this. Instead, use a static "%s" format
and the name as the argument.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
A couple places use OCFS2_DEBUG_FS where they really mean
CONFIG_OCFS2_DEBUG_FS.
Reported-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
suseconds_t is type long on most arches except sparc64 where it is type int.
This patch silences the following warnings that are generated when building
on it.
netdebug.c: In function 'nst_seq_show':
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 13 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 15 has type 'suseconds_t'
netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 17 has type 'suseconds_t'
netdebug.c: In function 'sc_seq_show':
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 19 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 21 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 23 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 25 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 27 has type 'suseconds_t'
netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 29 has type 'suseconds_t'
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch ensures the mount fails if the fs is unable to load the journal.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch silences an EINVAL error message in ocfs2_file_aio_read()
that is always due to a user error.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Fix printk format warnings when OCFS2_FS_STATS=n:
linux-next-20080528/fs/ocfs2/dlmglue.c: In function 'ocfs2_dlm_seq_show':
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'int'
linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'int'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds code to track the number of times the fs takes
various cluster locks as well as the times associated with it.
The information is made available to users via debugfs.
This patch was originally written by Jan Kara <jack@suse.cz>.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The stack-glue merge changed the way we use flags in dlmglue in that we now
use the fs/dlm equivalents. Unfortunately, a merge error left the new flock
code only partially updated. This took a while to show up though, because
the lock level constants are actually identical between o2dlm and fs/dlm.
The *_CONVERT and *_NOQUEUE flags have different values though, which is
eventually causing a crash in flags_to_o2dlm().
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch fixes a race that can result in an oops while adding a
lockres to the dlm lockres tracking list.
Bug introduced by mainline commit 29576f8bb5.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ->hangup() call was only used to execute ocfs2_hb_ctl. Now that
the generic stack glue code handles this, the underlying stack drivers
don't need to know about it.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Take o2hb_stop() out of the o2cb code and make it part of the generic
stack glue as ocfs2_leave_group(). This also allows us to remove the
ocfs2_get_hb_ctl_path() function - everything to do with hb_ctl is now
part of stackglue.c. o2cb no longer needs a ->hangup() function.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 needs to call out to the hb_ctl program at unmount for all cluster
stacks. The first step is to move the hb_ctl_path sysctl out of the
o2cb code and into the generic stack glue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch silences the build warnings concerning o2net_init_nst()
and friends when building without CONFIG_DEBUG_FS enabled.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch silences the build warnings concerning dlm_debug_init()
and friends when building without CONFIG_DEBUG_FS enabled.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch silences the build warnings concerning o2net_debugfs_init()
and friends when building without CONFIG_DEBUG_FS enabled.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The static structure describing the userspace cluster plugin for ocfs2
was named 'user_stack', which is a real pain when people are grep(1)ing
the tree for the program stack object 'user_stack'. Change the name to
something distinct and namespaced.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
kmalloc() during a localalloc window move can trigger the mm to prune
the dcache which inturn can trigger the fs to delete an inode causing
it start a recursive transaction.
The fix also makes the change in kmalloc during localalloc shutdown
just to be safe.
Fixes oss bugzilla#901
http://oss.oracle.com/bugzilla/show_bug.cgi?id=901
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds the ability to change attributes of a symlink.
Fixes oss bugzilla#963
http://oss.oracle.com/bugzilla/show_bug.cgi?id=963
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the following needlessly global functions static:
- stringify_lockname()
- dlm_debug_put()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the needlessly global struct o2cb_stack_ops static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the needlessly global struct ocfs2_control_device
static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Commit 52f7c21b61 was intended to move
/sys/o2cb to /sys/fs/o2cb, providing /sys/o2cb as a symlink for
backwards compatibility. However, the merge apparently added the
symlink but failed to move the directory, resulting in a duplicate
filename error. It's a one-line change that was missing.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().
Misc cleanups:
- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some ioctl()s can cause writes to the filesystem. Take these, and make them
use mnt_want/drop_write() instead.
[AV: updated]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch exposes o2net information via debugfs. The information includes
the list of sockets (sock_containers) as well as the list of outstanding
messages (send_tracking). Useful for o2dlm debugging.
(This patch is derived from an earlier one written by Zach Brown that
exposed the same information via /proc.)
[Mark: checkpatch fixes]
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
fs/ocfs2/dlm/ocfs2_dlm.ko and fs/ocfs2/dlm/ocfs2_dlmfs.ko get built if
CONFIG_FS_OCFS2 is specified. This isn't quite how it should happen any more
- the "o2cb" dlm modules should only be built if CONFIG_FS_OCFS2_O2CB is
set, so update the dlm Makefile accordingly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
We keep seeing bug reports related to NULL pointer derefs in
o2net_set_nn_state(). When I originally wrote up the configurable timeout
patch, I had tried to plan for multiple clusters. This was silly.
The timeout routines all use o2nm_single_cluster so there's no point in
passing an argument at all. This patch removes the arguments and kills those
bugs dead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@ disable unlikely @ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)
@@ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
As far as I can see there is nothing in ocfs2_ioctl that requires the BKL,
so use unlocked_ioctl
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_rename() was being too aggressive with the rename lock - we only need
it for certain forms of directory rename.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The function ocfs2_start_trans always returns either a valid pointer or a
value made with ERR_PTR, so its result should be tested with IS_ERR, not
with a test for 0.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In some cases(Inode stealing from other nodes), we may not want
ocfs2_reserve_suballoc_bits to allocate new groups from the
global_bitmap since it may already be full. So add a new parameter
for this.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_figure_merge_contig_type, we judge whether there exists
a cross extent block merge and enable it by setting CONTIG_LEFT
and CONTIG_RIGHT accordingly.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_merge_rec_left, when we find the merge extent is "CONTIG_RIGHT"
with the first extent record of the next extent block, we will merge it to
the next extent block and change all the related extent blocks accordingly.
In ocfs2_merge_rec_right, when we find the merge extent is "CONTIG_LEFT"
with the last extent record of the previous extent block, we will merge
it to the prevoius extent block and change all the related extent blocks
accordingly.
As for CONTIG_LEFTRIGHT, we will handle CONTIG_RIGHT first so that when
the index is zero, the merge process will be more efficient and easier.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
/sys/fs is where we really want file system specific sysfs objects.
Ocfs2-tools has been updated to look in /sys/fs/o2cb. We can maintain
backwards compatibility with old ocfs2-tools by using a sysfs symlink. After
some time (2 years), the symlink can be safely removed. This patch also adds
documentation to make it easier for people to figure out what /sys/fs/o2cb
is used for.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Currently, o2net connects to a node on hb_up and disconnects on
hb_down and net timeout.
It disconnects on net timeout is ok, but it should attempt to
reconnect back. This is because sometimes nodes get overloaded
enough that the network connection breaks but the disk hb does not.
And if we get into that situation, we either fence (unnecessarily)
or wait for its disk hb to die (and sometimes hang in the process).
So in this updated scheme, when the network disconnects, we keep
attempting to reconnect till we succeed or we get a disk hb down
event.
If the other node is really dead, then we will eventually get a
node down event. If not, we should be able to connect again and
continue.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A previous patch added KERN_NOTICE to printks printing the lockres that
cluttered the output. This patch removes the log level. For people concerned
with syslog clutter, please note we now use this facility to print lockres
only during an error.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
__dlm_print_one_lock_resource was printing lockname incorrectly.
Also, we now use printk directly instead of mlog as the latter prints
the line context which is not useful for this print.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch helps in consolidating debugging related functions in dlmdebug.c.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all the lockres' on the purgelist it can fit in one page
into a debugfs file. Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all mles it can fit in one page into a debugfs file.
Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch moves some mle related definitions from dlmmaster.c
to dlmcommon.h. Future patches need these definitions to dump mle
debugging information.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.beckeroracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all the lockres' alongwith all the locks into
a debugfs file. Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps the dlm state (dlm_ctxt) into a debugfs file.
Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates the debugfs directories that will hold the
files to be used to dump the dlm state.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch links all the lockres' to a tracking list in dlm_ctxt.
We will use this in an upcoming patch that will walk the entire
list and to dump the lockres states to a debugfs file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the o2dlm allocate memory for lockres, lockname and lock
structures from slabcaches rather than kmalloc. This allows us to not only
make these allocs more efficient but also allows us to track the memory being
consumed by these structures.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch renames dlm_mle_slabcache to prevent namespace clashes with fs/dlm.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 now supports plug-ins for the classic O2CB stack as well as
userspace cluster stacks in conjunction with fs/dlm. This allows zero,
one, or both of the plug-ins to be selected in Kconfig. For local mounts
(non-clustered), neither plug-in is needed. Both plugins can be loaded
at one time, the runtime will select the one needed for the cluster
systme in use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add ocfs2_stack_user.ko to the Makefile so that it builds.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The masklog code is in the o2cb stack, but ocfs2_lockid.h now needs to
be included by the user stack. The BUG() in ocfs2_lock_type_string()
does not need masklog support, so change it to a regular BUG_ON().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add code to use fs/dlm.
[ Modified to be part of the stack_user module -- Joel ]
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The "SETV" message sets the filesystem locking protocol version as
negotiated by the client. The client negotiates based on the maximum
version advertised in /sys/fs/ocfs2/max_locking_protocol.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the second part of the ocfs2_control handshake. After
negotiating the ocfs2_control protocol, the daemon tells the filesystem
what the local node id is via the SETN message.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When the control daemon sees a node go down, it sends a DOWN message
through the ocfs2_control device.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When a control daemon opens the ocfs2_control device, it must perform a
handshake to tell the filesystem it is something capable of monitoring
cluster status. Only after the handshake is complete will the filesystem
allow mounts.
This is the first part of the handshake. The daemon reads all supported
ocfs2_control protocols, then writes in the protocol it will use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_control misc device is how a userspace control daemon (controld)
talks to the filesystem. Introduce the bare-bones filesystem ops.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add a skeleton for the stack_user module. It's just the barebones module
code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Userspace can now query and specify the cluster stack in use via the
/sys/fs/ocfs2/cluster_stack file. By default, it is 'o2cb', which is
the classic stack. Thus, old tools that do not know how to modify this
file will work just fine. The stack cannot be modified if there is a
live filesystem.
ocfs2_cluster_connect() now takes the expected cluster stack as an
argument. This way, the filesystem and the stack glue ensure they are
speaking to the same backend.
If the stack is 'o2cb', the o2cb stack plugin is used. For any other
value, the fsdlm stack plugin is selected.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The filesystem gains the USERSPACE_STACK incomat bit and the
s_cluster_info field on the superblock. When a userspace stack is in
use, the name of the stack is stored on-disk for mount-time
verification.
The "cluster_stack" option is added to mount(2) processing. The mount
process needs to pass the matching stack name. If the passed name and
the on-disk name do not match, the mount is failed.
When using the classic o2cb stack, the incompat bit is *not* set and no
mount option is used other than the usual heartbeat=local. Thus, the
filesystem is compatible with older tools.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Introduce a set of sysfs files that describe the current stack glue
state. The files live under /sys/fs/ocfs2. The locking_protocol file
displays the version of ocfs2's locking code. The
loaded_cluster_plugins file displays all of the currently loaded stack
plugins. When filesystems are mounted, the active_cluster_plugin file
will display the plugin in use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.
The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().
ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Define the ocfs2_stack_operations structure. Build o2cb_stack_ops from
all of the o2cb-specific stack functions. Change the generic stack glue
functions to call the stack_ops instead of the o2cb functions directly.
The o2cb functions are moved to stack_o2cb.c. The headers are cleaned up
to where only needed headers are included.
In this code, stackglue.c and stack_o2cb.c refer to some shared
extern variables. When they become modules, that will change.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Split off the o2cb-specific funtionality from the generic stack glue
calls. This is a precurser to wrapping the o2cb functionality in an
operations vector.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlmglue.c was still referencing a raw o2dlm lksb in one instance. Let's
create a generic ocfs2_dlm_dump_lksb() function. This allows underlying
DLMs to print whatever they want about their lock.
We then move the o2dlm dump into stackglue.c where it belongs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When using fsdlm, -EAGAIN is returned in the async callback for NOQUEUE
requests. Fix up dlmglue to expect this.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
o2dlm has the non-standard behavior of providing a cancel callback
(unlock_ast) even when the cancel has failed (the locking operation
succeeded without canceling). This is called CANCELGRANT after the
status code sent to the callback. fs/dlm does not provide this
callback, so dlmglue must be changed to live without it.
o2dlm_unlock_ast_wrapper() in stackglue now ignores CANCELGRANT calls.
Because dlmglue no longer sees CANCELGRANT, ocfs2_unlock_ast() no longer
needs to check for it. ocfs2_locking_ast() must catch that a cancel was
tried and clear the cancel state.
Making these changes opens up a locking race. dlmglue uses the the
OCFS2_LOCK_BUSY flag to ensure only one thread is calling the dlm at any
one time. But dlmglue must unlock the lockres before calling into the
dlm. In the small window of time between unlocking the lockres and
calling the dlm, the downconvert thread can try to cancel the lock. The
downconvert thread is checking the OCFS2_LOCK_BUSY flag - it doesn't
know that ocfs2_dlm_lock() has not yet been called.
Because ocfs2_dlm_lock() has not yet been called, the cancel operation
will just be a no-op. There's nothing to cancel. With CANCELGRANT,
dlmglue uses the CANCELGRANT callback to clear up the cancel state.
When it comes around again, it will retry the cancel. Eventually, the
first thread will have called into ocfs2_dlm_lock(), and either the
lock or the cancel will succeed. The downconvert thread can then do its
downconvert.
Without CANCELGRANT, there is nothing to clean up the cancellation
state. The downconvert thread does not know to retry its operations.
More importantly, the original lock may be blocking on the other node
that is trying to cancel us. With neither able to make progress, the
ast is never called and the cancellation state is never cleaned up that
way. dlmglue is deadlocked.
The OCFS2_LOCK_PENDING flag is introduced to remedy this window. It is
set at the same time OCFS2_LOCK_BUSY is. Thus, the downconvert thread
can check whether the lock is cancelable. If not, it just loops around
to try again. Once ocfs2_dlm_lock() is called, the thread then clears
OCFS2_LOCK_PENDING and wakes the downconvert thread. Now, if the
downconvert thread finds the lock BUSY, it can safely try to cancel it.
Whether the cancel works or not, the state will be properly set and the
lock processing can continue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.
[ Disconnect when node query fails -- Joel ]
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.
We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.
The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.
With this change, only stackglue.c includes cluster/ headers.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.
[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
truly opaque. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.
heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.
The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Wrap the lock status block (lksb) in a union. Later we will add a union
element for the fs/dlm lksb. Create accessors for the status and lvb
fields.
Other than a debugging function, dlmglue.c does not directly reference
the o2dlm locking path anymore.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the ocfs2_dlm_lock/unlock() functions to return -errno values.
This is the first step towards elminiating dlm_status in
fs/ocfs2/dlmglue.c. The change also passes -errno values to
->unlock_ast().
[ Fix a return code in dlmglue.c and change the error translation table into
an array of ints. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 generic code should use the values in <linux/dlmconstants.h>.
stackglue.c will convert them to o2dlm values.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old slot map had a few limitations:
- It was limited to one block, so the maximum slot count was 255.
- Each slot was signed 16bits, limiting node numbers to INT16_MAX.
- An empty slot was marked by the magic 0xFFFF (-1).
The new slot map format provides 32bit node numbers (UINT32_MAX), a
separate space to mark a slot in use, and extra room to grow. The slot
map is now bounded by i_size, not a block.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The slot map file is merely an array of __le16. Wrap it in a structure for
cleaner reference.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>