The ref count of certain glock's got elevated too far during unlink
which caused umount to fail. This fixes it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The attributes logic for immutable was wrong so that there was
not way to remove this attribute once set. This fixes the
bug.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The gh_owner field shouldn't be set or reset outside the glock code.
These were left over from when recursive locking was allowed. It
isn't any more, so they are not needed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The original code ordered the blocks allocated in the build_height
routine backwards causing excessive disk seeks during a read of the
metadata. This patch reverses the order to try and reduce disk seeks.
Example: A five level metadata tree, I = Inode, P = Pointers, D = Data
You need to read the blocks in the order:
I P5 P4 P3 P2 P1 D
in order to read a single data block. The new code now orders the blocks
in this way. The old code used to order them as:
I P1 P2 P3 P4 P5 D
requiring two extra seeks on average. Note that for files which are
grown by gradual extension rather than by truncate or by llseek/write
at a large offset, this doesn't apply. In the case of writing to a
file linearly, this routine will only be called upon to extend the
height of the tree by one block at a time, so the ordering is
determined by when its called rather than by the internals of the
routine itself. Optimising that part of the ordering is a much
harder problem.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds readpages support (and also corrects a small bug in
the readpage error path at the same time). Hopefully this will
improve performance by allowing GFS to submit larger lumps of
I/O at a time.
In order to simplify the setting of BH_Boundary, it currently gets
set when we hit the end of a indirect pointer block. There is
always a boundary at this point with the current allocation code.
It doesn't get all the boundaries right though, so there is still
room for improvement in this.
See comments in fs/gfs2/ops_address.c for further information about
readpages with GFS2.
Signed-off-by: Steven Whitehouse
Well, I managed to track down the bug in gfs2 that was causing
my grief. Below is a patch for the problem. Please incorporate
as you see fit. Or should I say: as you see git.
The problem was basically that you never set d_ops for the root
inode, so the wrong hash algorithm was being used. But only for
the root directory. Turns out that if I used subdirectories, it
used the proper hash and my files were found just fine.
Signed-off-by: Robert S Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As pointed out by Wendy Cheng, the logic in GFS2's writepage() function
wasn't quite right with respect to invalidating pages when a file has been
truncated. This patch fixes that.
CC: Wendy Cheng <wcheng@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 unused functions
- remove the following global function that was both unused and
unimplemented:
- super.c: gfs2_do_upgrade()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Despite my earlier careful search, there was a recursive lock left
in the deallocation code. This removes it. It also should speed up
deallocation be reducing the number of locking operations which take
place by using two "try lock" operations on the two locks involved in
inode deallocation which allows us to grab the locks out of order
(compared with NFS which grabs the inode lock first and the iopen
lock later). It is ok for us to fail while doing this since if it
does fail it means that someone else is still using the inode and
thus it wouldn't be possible to deallocate anyway.
This fixes the bug reported to me by Rob Kenna.
Cc: Rob Kenna <rkenna@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This saves the journal recovery result and makes it visible through sysfs.
User space needs to know if the node actually recovered the journal or
tried and gave up.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch changes the last user of recursive locking so that
it no longer needs this feature and removes it from the glock
layer. This makes the glock code a lot simpler and easier to
understand. Its also a prerequsite to adding support for the
AOP_TRUNCATED_PAGE return code (or at least it is if you don't
want your brain to melt in the process)
I've left in a couple of checks just in case there is some place
else in the code which is still using this feature that I didn't
spot yet, but they can probably be removed long term.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We don't need the inherited flags since this action can be
implied by setting the flags on directories where they
wouldn't otherwise make sense. It reduces the number of extra
flags by two. Also updated the list of flags to take account of
one extra ext2/3 flag.
Cc: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Remove select of SYSFS as requested by Greg KH. Change whitespace to
tabs rather than spaces in places where it was incorrect and removed
'default m' as suggested by Adrian Bunk.
Reorganised Makefile as suggested by Sam Ravnborg.
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As per Andrew Morton's comments, remove uneeded casts and use
wait_event_interruptible() rather than open code the wait.
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
1. Comment whitespace fix
2. Removed unused header files from dir.c
3. Split the gfs2_dir_get_buffer() function into two functions
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In order to make the file and line number reporting work
correctly, this has been moved back into the header file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is one of the changes related to journal recovery I mentioned a
couple weeks ago. We can get into a situation where there are only
readonly nodes currently mounting the fs, but there are journals that need
to be recovered. Since the readonly nodes can't recover journals, the
next rw mounter needs to go through and check all journals and recover any
that are dirty (i.e. what the first node to mount the fs does). This rw
mounter needs to skip the journals held by the existing readonly nodes.
Skipping those journals amounts to using the TRY flag on the journal locks
so acquiring the lock of a journal held by a readonly node will fail
instead of blocking indefinately.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
At some stage, a mutex was added to gfs2_glock_put() without
checking all its call sites. Two of them were called from
under a spinlock causing random delays at various points and
crashes.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When allocating memory to sort directory entries, use vmalloc()
rather than kmalloc() since for larger directories, the required
size can easily be graeter than the 128k maximum of kmalloc().
Also adding the first steps towards getting the AOP_TRUNCATED_PAGE
return code get in the glock code by flagging all places where we
request a glock and we are holding a page lock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A typo in the directory code was causing postmark to fail
somewhere in the allocation code, since it was unable to
find newly allocated directory leaf blocks under certain
circumstances.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A small update to the journaling code to change the way that
the "extra" blocks are accounted for in the journal. These are
used at a rate of one per 503 metadata blocks or one per 251
journaled data blocks (or just one if the total number of journaled
blocks in the transaction is smaller). Since we are using them at
two different rates the old method of accounting for them no longer
works and we count them up as required.
Since the "per transaction" accounting can't handle this (there is no
fixed number of header blocks per transaction) we have to account for
it in the general journal code. We now require that each transaction
reserves more blocks than it actually needs to take account of the
possible extra blocks.
Also a final fix to dir.c to ensure that all ref counts are handled
correctly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The last patch missed some other instances of incorrect ref counting,
this fixes all of those too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This fixes a ref count bug that sometimes showed up a umount time
(causing it to hang) but it otherwise mostly harmless. At the same
time there are some clean ups including making the log operations
structures const, moving a memory allocation so that its not done
in the fast path of checking to see if there is an outstanding
transaction related to a particular glock.
Removes the sd_log_wrap varaible which was updated, but never actually
used anywhere. Updates the gfs2 ioctl() to run without the kernel lock
(which it never needed anyway). Removes the "invalidate inodes" loop
from GFS2's put_super routine. This is done in kill super anyway so
we don't need to do it here. The loop was also bogus in that if there
are any inodes "stuck" at this point its a bug and we need to know
about it rather than hide it by hanging forever.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This puts the finishing touches to the ioctl support and also
removes a couple of unused fields from GFS2's private per file
structure.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Some interfaces have changed. In particular one of the posix
locking functions has changed prototype, along with the
address space operation invalidatepage and the block getting
callback to the direct IO function.
In addition add the splice file operations. These will need to
be updated to support AOP_TRUNCATED_PAGE before they will be
of much use to us.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is designed as a fs independent way to set flags on a
particular inode. The values of the ioctl() and flags are
designed to be identical to the ext2/3 values. Assuming that
this plan is acceptable to people in general, the plan is to
then move other fs across to using the same set of #defines,
etc.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the debugging code in trans.c and at the same time improve
the debugging code for gfs2_holders. The new code should be pretty
fast during the normal case and provide just as much information
in case of errors (or more).
One small function from glock.c has moved to glock.h as a static inline so
that its return address won't get in the way of the debugging.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Replace the lock_for_trans()/lock_for_flush() functions with an rwsem.
In fact the sd_log_flush_lock becomes an rwsem (the write part of it)
and is extended slightly to cover everything that the lock_for_flush()
used to cover. The read part of the lock is instead of lock_for_trans().
This corrects the races in the original code and reduces the code size.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Look for "nodir" in the hostdata mount option which is used to
set the NODIR flag in dlm_new_lockspace().
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This reduces the size of the directory code by about 3k and gets
readdir() to use the functions which were introduced in the previous
directory code update.
Two memory allocations are merged into one. Eliminates zeroing of some
buffers which were never used before they were initialised by
other data.
There is still scope for further improvement in the directory code.
On the logging side, a hand created mutex has been replaced by a
standard Linux mutex in the log allocation code.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The various flags on inodes will in future be set and queried
via the extended attributes interface, so this interface is no
longer required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Due to a typo, the dir leaf split operation was (for the first
split in a directory) writing the new hash vaules at the
wrong offset. This is now fixed.
Also some other tidy ups are included:
- We use GFS2's hash function for dentries (see ops_dentry.c) so that
we don't have to keep recalculating the hash values.
- A lot of common code is eliminated between the various directory
lookup routines.
- Better error checking on directory lookup (previously different
routines checked for different errors)
- The leaf split operation has a couple of redundant operations
removed from it, so it should be faster.
There is still further scope for further clean ups in the directory
code, and readdir in particular could do with slimming down a bit.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
In order to separate out the filesystem's metadata from "normal"
files and directories, a new filesystem type has been created.
It is called gfs2meta and mounting it gives access to the files
that were previously under .gfs2_admin (well still are until mkfs
is altered, which is next on the adgenda).
Its not currently possible to mount both gfs2 and gfs2meta on the
same block device at the same time. A future patch will allow that
to happen.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix a bug I introduced earlier with a kfree() and usage of
a structure in the wrong order. Also try and get the counts
of the journaled data buffers "more correct". Still some work
to do in this area though.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We no longer lookup ".gfs2_admin" in the root directory in order to
find it, but instead use the inode number given in the superblock.
Both the root directory and the admin directory are now looked up using
the same routine, so the redundant code is removed.
Also, there is no longer a reference to the root inode in the
GFS2 super block. When required this can be retreived via
sb->s_root->d_inode instead.
Assuming that we introduce a metadata filesystem type for GFS, then
this is a first step towards that goal.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For every filesystem operation where we need a transaction, we
now make one less memory allocation.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As suggested by Pekka Enberg <penberg@cs.helsinki.fi>.
The DIV_RU macro is renamed DIV_ROUND_UP and and moved to kernel.h
The other macros are gone from gfs2.h as (although not requested
by Pekka Enberg) are a number of included header file which are now
included individually. The inode number comparison function is
now an inline function.
The DT2IF and IF2DT may be addressed in a future patch.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As requested by:
Pavel Machek <pavel@suse.cz>
Pavel's other comments will be dealt with in later patches.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
All printk calls now have KERN_ set where required and a couple of
kmalloc(), memset(.., 0, ...) calls changed to kzalloc().
This is in response to comments from:
Pekka Enberg <penberg@cs.helsinki.fi> and
Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is the second of two patches removing support for range
locks from the DLM
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Heinz had spotted that I'd forgotten to test in databuf_lo_add()
that the data buffer in question hadn't already been added to
the list. This was causing an infinite loop later on in the
"before commit" routine.
This means that GFS2 is now ready to be tested by everybody.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
As well as a number of minor bug fixes, this patch changes GFS
to use mutices rather than semaphores. This results in better
information in case there are any locking problems.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There was a bug in the unstuffing logic which caused a crash
under certain circumstances. This is now fixed.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Two internal files which are read through the gfs2_internal_read()
routine were already locked when the routine was called and this
do not need locking at the redapages level.
This patch introduces a struct file which is used as a sentinal
so that readpage will only perform locking in the case that the
struct file passed to it is _not_ equal to this sentinal.
Since the comments in the generic kernel code indicate that the
struct file will never be used for anything other than passing
straight through to readpage(), this should be ok.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch adds back O_DIRECT support with various caveats
attached:
1. Journaled data can be read via O_DIRECT since its now the
same on disk format as normal data files.
2. Journaled data writes with O_DIRECT will be failed sliently
back to normal writes (should we really do this I wonder or
should we return an error instead?)
3. Stuffed files will be failed back to normal buffered I/O
4. All the usual corner cases (write beyond current end of file,
write to an unallocated block) will also revert to normal buffered I/O.
The I/O path is slightly odd as reads arrive at the page cache layer
with the lock for the file already held, but writes arrive unlocked.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
There were one or two fields in structures which didn't get changed
last time back to their gfs1 sizes and alignments. One or two constants
have also changed back to their original values which were missed the
first time.
Its possible that indirect pointer blocks might need to change. If
they don't we'll have to rewrite them all on upgrade due to a change
in the amount of padding that they use.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Umount is now working correctly again. The bug was due to
not getting an extra ref count when mounting the fs. We
should have bumped it by two (once for the internal pointer
to the root inode from the super block and once for the
inode hanging off the dcache entry for root).
Also this patch tidys up the code dealing with looking up
and creating inodes. We now pass Linux inodes (with gfs2_inodes
attached) rather than the other way around and this reduces code
duplication in various places.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is a very large patch, with a few still to be resolved issues
so you might want to check out the previous head of the tree since
this is known to be unstable. Fixes for the various bugs will be
forthcoming shortly.
This patch removes the special data format which has been used
up till now for journaled data files. Directories still retain the
old format so that they will remain on disk compatible with earlier
releases. As a result you can now do the following with journaled
data files:
1) mmap them
2) export them over NFS
3) convert to/from normal files whenever you want to (the zero length
restriction is gone)
In addition the level at which GFS' locking is done has changed for all
files (since they all now use the page cache) such that the locking is
done at the page cache level rather than the level of the fs operations.
This should mean that things like loopback mounts and other things which
touch the page cache directly should now work.
Current known issues:
1. There is a lock mode inversion problem related to the resource
group hold function which needs to be resolved.
2. Any significant amount of I/O causes an oops with an offset of hex 320
(NULL pointer dereference) which appears to be related to a journaled data
buffer appearing on a list where it shouldn't be.
3. Direct I/O writes are disabled for the time being (will reappear later)
4. There is probably a deadlock between the page lock and GFS' locks under
certain combinations of mmap and fs operation I/O.
5. Issue relating to ref counting on internally used inodes causes a hang
on umount (discovered before this patch, and not fixed by it)
6. One part of the directory metadata is different from GFS1 and will need
to be resolved before next release.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the function in GFS2 which deals with truncation of
partial blocks. Some of the code is "borrowed" from ext3
since it appears to give a good model of how to do this
operation. The function is renamed gfs2_block_truncate_page
accordingly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Add the new external read function. Its temporarily in jdata.c
even though the protoype is in ops_file.h - this will change
shortly. The current implementation will change to a page cache
one when that happens.
In order to effect the above changes, the various internal inodes
now have Linux inodes attached to them. We keep the references to
the Linux inodes, rather than the gfs2_inodes in the super block.
In order to get everything to work correctly I've had to reorder
the init sequence on mount (which I should probably have done
earlier when .gfs2_admin was made visible).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Copy & rename various jdata functions into dir.c. The plan
being that directory metadata format will not change although
the journalled data format for "normal" files will change.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The code in this file is no longer used as the rindex file is
now accessible to userspace, so can be read/written directly.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For some reason a function pointer was being passed through
the truncate code which only ever took one value. This removes
the function pointer and replaces it with a single call to
the function in question.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we'll need to pin data if we are going to journal it, then
I'm renaming this function to make it less confusing. It might also
be worth moving it into lops.c since there are no users outside that
file.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I'd like to be rid of these memory allocations entirely so far as is
possible. For the moment though, mark them GFP_NOFS to make them less
harmful.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Removing the gfs2_databuf structure and using gfs2_bufdata instead
is a step towards allowing journaling of data without requiring the
metadata header on each journaled block. The idea is to merge the
code paths for ordered data with that of journaled data, with the
log operations in lops.c tacking account of the different types of
buffers as they are presented to it. Largely the code path for
metadata will be similar too, but obviously through a different set
of log operations.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Passes the flag through to ensure that the correct log operations are
invoked when the flag is set.
Signed-off-by: Steven Whitehouse: <swhiteho@redhat.com>
This adds an extra argument to gfs2_trans_add_bh() to indicate whether the
bh being added to the transaction is metadata or data. Its currently unused
since all existing callers set it to 1 (metadata) but following patches will
make use of it.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We no longer allocate a dinode on the stack in init_dinode()
and we no longer use gfs2_dinode_out (eliminating one copy) and
gfs2_meta_header_in (eliminating another copy). The meta_header_in
fucntion is now no longer referenced from outside gfs2_ondisk.c, so
make it static.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update the dlm interface module to take account of the recently
removed third argument to kobject_uevent()
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
This brings the lock modules uptodate and removes the stray
.mod.c file which accidently got included in the last check in.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch contains the pluggable locking modules
for GFS2.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch contains all the core files for GFS2.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>